The shift from keyword search to a queryless way to get information has arrived
Google Discover is an AI-driven content recommendation tool included with the Google Search app. Here is what we learned from the data available in the Google Search Console.
Google introduced Discover in 2017 and it claims that there are already 800M active users consuming content using this new application. A few days back Google added in the Google Search Console statistical data on the traffic generated by Discover. This is meant to help webmasters, and publishers in general, understand what content is ranking best on this new platform and how it might be different from the content ranking on Google Search.
What was very shocking for me to see, on some of the large websites we work for with our SEO management service, is that between 25% and 42% of the total number of organic clicks are already generated by this new recommendation tool. I did expect Discover to drive a significant amount of organic traffic but I totally underestimated its true potentials.
A snapshot from GSC on a news and media site
In Google’s AI-first approach, organic traffic is no longer solely dependent on queries typed by users in the search bar.
This has a tremendous impact on both content publishers, business owners and the SEO industry as a whole.
Machine learning is working behind the scenes to harvest data about users’ behaviors, to learn from this data and to suggest what is relevant for them at a specific point in time and space.
Let’s have a look at how Google explains how Discover works.
[…] We’ve taken our existing Knowledge Graph—which understands connections between people, places, things and facts about them—and added a new layer, called the Topic Layer, engineered to deeply understand a topic space and how interests can develop over time as familiarity and expertise grow. The Topic Layer is built by analyzing all the content that exists on the web for a given topic and develops hundreds and thousands of subtopics. For these subtopics, we can identify the most relevant articles and videos—the ones that have shown themselves to be evergreen and continually useful, as well as fresh content on the topic. We then look at patterns to understand how these subtopics relate to each other, so we can more intelligently surface the type of content you might want to explore next.
Embrace Semantics and publish data that can help machines be trained.
Once again, the data that we produce, sustains and nurture this entire process. Here is an overview of the contextual data, besides the Knowledge Graph and the Topic Layer that Google uses to train the system:
This research is limited to the data gathered from three websites only, while the sample was small few patterns emerged:
Google tends to distribute content between Google Search and Google Discover (the highest overlap I found was 13.5% – these are pages that, since Discover data has been collected on GSC, have received traffic from both channels)
Pages in Discover have not the highest engagement in terms of bounce rate or average time on page when compared to all other pages on a website. They are relevant for a specific intent and well-curated but I didn’t see any correlation with social metrics.
Traffic seems to work with a 48-hours or 72-hours spike as already seen for the top stories.
To optimize your content for Google Discover, here is what you should do.
1. Make sure you have an entity in the Google Knowledge Graph or an account on Google My Business
Either your business, or product, is already in the Google Knowledge Graph or it is not. If it is not, there are no chances that the content you are writing about for your company or product will appear in Discover (unless this content is bound to other broader topics). I am able to read articles about WordLift in my Discover stream since WordLift has an entity in the Google Knowledge Graph. From the configuration screenshot above we can actually see there are indeed more entities when I search for “WordLift”:
one related to Google My Business (WordLift Software Company in Rome is the label we use on GMB),
one from the Google Knowledge Graph (WordLift Company)
one presumably about the product (without any tagline)
one about myself as CEO of the company
So, get into the graph and make sure to curate your presence on Google My Business. Very interestingly we can see the relationship between myself and WordLift is such that when looking for WordLift, Google shows also Andrea Volpini as a potential topic of interest.
In these examples, we see that from Google Search I can start following persons that are already in the Google Knowledge Graph and the user experience in Discover for content related to the entity WordLift.
2. Focus on high-quality content and a great user experience
It is good also to remember that the quality in terms of both the content you write (alignment with Google’s content quality policies) and the user experience on your website is essential. A website that loads on a mobile connection in 10 seconds or more is not going to be featured in Discover. A clickbait article, with more ads than content, is not going to be featured in Discover. An article written by copying other websites and patently infringing copyrights laws is not likely to be featured in Discovery.
3. Be relevant and write content that truly helps people by responding to their specific information need
Recommendations tools like Discover only succeed when they are capable of enticing the user to click on the suggested content. To do so effectively they need to work with content designed to answer a specific request. Let’s see a few examples “I am interested in SEO” (entity “Search Engine Optimization“), or “I want to learn more about business models” (entity “Business Model”).
The more we can match the intent of the user, in a specific context (or micro-moment if you like), the more we are likely to be chosen by a recommendation tool like Discover.
4. Always use an appealing hi-res image and a great title
Images play a very important role in Google‘s card-based UI as well as in Discover. Whether you are presenting a cookie recipe or an article, the image you chose will be presented to the user and will play its role in enticing the click. Besides the editorial quality of the image I also suggest you follow the AMP requirements for images (the smallest side of the featured image should be at least 1.200 px). Similarly, a good title, much like in the traditional SERP is super helpful in driving the clicks.
5. Organize your content semantically
Much like Google does, using tools like WordLift, you can organize content with semantic networks and entities. This allows you to: a) help Google (and other search engines) gather more data about “your” entities b) organize your content the same way Google does (and therefore measure its performance by looking at topics and not pages and keywords) c) train our own ML models to help you make better decisions for your business.
Let me give you a few examples. If I provide, let’s say the information about our company, and the industry we work for using entities that Google can crawl. Google‘s AI will be able to connect content related to our business with people interested in “startups”, “seo” and “artificial intelligence“. Machine learning, as we usually say, is hungry for data and semantically rich data is what platforms like Discover use to learn how to be relevant.
If I look at the traffic I generate on my website, not only in terms of pages and keywords but using entities (as we do with our new search rankings dashboard or the Google Analytics integration) I can quickly see what content is relevant for a given topic and improve it.
Use entities to analyze our your content is performing on organic search
Here below a list of pages, we have annotated with the entity “Artificial Intelligence“. Are these pages relevant for someone interested in AI? Can we do a better job in helping these people learn more about this topic?
A few of the articles tagged with the entity “Artificial Intelligence” and their respective query
Learn more about Google Discover – Questions & Answers
Following in this article, I have a list of questions that I have answered in these past days as data from Discover was made available in GSC. I hope you’ll find it useful too.
How does Discover work from the end-user perspective?
The suggestions in Discover are entity-based. Google groups content that believes relevant using entities in its Knowledge Graph (i.e. “WordLift”, “Andrea Volpini”, “Business” or “Search Engine Optimization“). Entities are called topics. The content-based user filtering algorithm behind Discover can be configured from a menu in the application (“Customize Discover”) and fine-tuned over time by providing direct feedback on the recommended content in the form of “Yes, I want more of this”, “No, I am not interested”. Using Reinforcement Learning (a specific branch of Machine Learning) and Neural Matching (different ways of understanding what the content is about) the algorithm is capable of creating a personalized feed of information from the web. New topics can be followed by clicking on the “+” sign.
Topics are organized in a hierarchy of categories and subcategories (such as “Sport”, “Technology”). Read more here on how to customize Google Discover.
How can I access Discover?
On Android, in most devices, accessing Discover is as simple as swiping, from the home screen to the right.
Is Google Discover available only in the US?
No, Google Discover is already available worldwide and in multiple languages and it is part of the core search experience on all Android devices and on any iOS devices with the Google Search app installed. Discover is also available in Google Chrome.
Do I have to be on Google News to be featured in Discover?
No, Google Discover uses also content that is not published on Google News. It is more likely that a news site will appear on Google Discover due to the amount of content published every day and the different topics that a news site usually covers.
Is evergreen content eligible for Discover or only freshly updated articles are?
Evergreen content, that fits a specific information need, is as important as newsworthy content. I spotted an article from FourWeekMBA.com (Gennaro’s blog on business administration and management) that was published 9 months ago under the entity “business”.
Does a page need to rank high on Google Search to be featured in Discover?
Quite interestingly, on a news website where I analyzed the GSC data, only 13.5% of the pages featured in Discover had received traffic on Google Search. Pages that received traffic on both channels had a position on Google Search <=8.
Correlation of Google Discover Clicks and Google Search Position
How can I measure the impact of Discover from Google Analytics?
A simple way is to download the .csv file containing all the pages listed in the Discover report in GSC and create an advanced filter in Google Analytics under Behaviour > Site Content > All pages with the following combination of parameters:
Filtering all pages that have received traffic from Discover in Google Analytics
Discover is yet another important step in the evolution of search engines in answer and discovery machines that help us sift in today’s content multiverse.
Keep following us, and give WordLift a spin with our free trial!
One of the most fascinating features of deep neural networks applied to NLP is that, provided with enough examples of human language, they can generate text and help us discover many of the subtle variations in meanings. In a recent blog post by Google research scientist Brian Strope and engineering director Ray Kurzweil we read:
“The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc.”
Following this hierarchical structure, new computational language models, aim at simplifying the way we communicate and have silently entered our daily lives; from Gmail “Smart Reply” feature to the keyboard in our smartphones, recurrent neural network, and character-word level prediction using LSTM (Long Short Term Memory) have paved the way for a new generation of agentive applications.
From keyword research to keyword generation
As usual with my AI-powered SEO experiments, I started with a concrete use-case. One of our strongest publishers in the tech sector was asking us new unexplored search intents to invest on with articles and how to guides. Search marketers, copywriters and SEOs, in the last 20 years have been scouting for the right keyword to connect with their audience. While there is a large number of available tools for doing keyword research I thought, wouldn’t it be better if our client could have a smart auto-complete to generate any number of keywords in their semantic domain, instead than keyword data generated by us?The way a search intent (or query) can be generated, I also thought, is also quite similar to the way a title could be suggested during the editing phase of an article. And titles (or SEO titles), with a trained language model that takes into account what people search, could help us find the audience we’re looking for in a simpler way.
What makes an RNNs “more intelligent” when compared to feed-forward networks, is that rather than working on a fixed number of steps they compute sequences of vectors. They are not limited to process only the current input, but also everything that they have perceived previously in time.
This characteristic makes them particularly efficient in processing human language (a sequence of letters, words, sentences, and paragraphs) as well as music (a sequence of notes, measures, and phrases) or videos (a sequence of images).
Here above you can see the difference between a recurrent neural network and a feed-forward neural network. Basically, RNNs have a short-memory that allow them to store the information processed by the previous layers. The hidden state is looped back as part of the input. LSTMs are an extension of RNNs whose goal is to “prolong” or “extend” this internal memory – hence allowing them to remember previous words, previous sentences or any other value from the beginning of a long sequence.
The LSTM cell where each gate works like a perceptron.
Imagine a long article where I explained that I am Italian at the beginning of it and then this information is followed by other let’s say 2.000 words. An LSTM is designed in such a way that it can “recall” that piece of information while processing the last sentence of the article and use it to infer, for example, that I speak Italian. A common LSTM cell is made of an input gate, an output gate and a forget gate. The cell remembers values over a time interval and the three gates regulate the flow of information into and out of the cell much like a mini neural network. In this way, LSTMs can overcome the vanishing gradient problem of traditional RNNs.
If you want to learn more in-depth on the mathematics behind recurrent neural networks and LSTMs, go ahead and read this article by Christopher Olah.
Let’s get started: “Io sono un compleanno!”
After reading Andrej Karpathy’s blog post I found a terrific Python library called textgenrnn by Max Woolf. This library is developed on top of TensorFlow and makes it super easy to experiment with Recurrent Neural Network for text generation.
Before looking at generating keywords for our client I decided to learn text generation and how to tune the hyperparameters in textgenrnn by doing a few experiments.
AI is interdisciplinary by definition, the goal of every project is to bridge the gap between computer science and human intelligence.
I started my tests by throwing in the process a large text file in English that I found on Peter Norvig’s website (https://norvig.com/big.txt) and I end up, thanks to the help of Priscilla (a clever content writer collaborating with us), “resurrecting” David Foster Wallace with its monumental Infinite Jest (provided in Italian from Priscilla’s ebook library and spiced up with some of her random writings).
At the beginning of the training process – in a character by character configuration – you can see exactly what the network sees: a nonsensical sequence of characters that few epochs (training iteration cycles) after will transform into proper words.
As I became more accustomed to the training process I was able to generate the following phrase:
“Io sono un compleanno. Io non voglio temere niente? Come no, ancora per Lenz.”
“I’m a birthday. I don’t want to fear anything? And, of course, still for Lenz.”
David Foster Wallace
Unquestionably a great piece of literature 😅that gave me the confidence to move ahead in creating a smart keyword suggest tool for our tech magazine.
The dataset used to train the model
As soon as I was confident enough to get things working (this means basically being able to find a configuration that – with the given dataset – could produce a language model with a loss value equal or below 1.0), I asked Doreid, our SEO expert to work on WooRank’s API and to prepare a list of 100.000 search queries that could be relevant for the website.
To scale up the number we began by querying Wikidata to get a list of software for Windows that our readers might be interested to read about. As for any ML, project data is the most strategic asset. So while we want to be able to generate never-seen-before queries we also want to train the machine on something that is unquestionably good from the start.
The best way to connect words to concepts is to define a context for these words. In our specific use case, the context is primarily represented by software applications that run on the Microsoft Windows operating system. We began by slicing the Wikidata graph with a simple query that provided us with the list of 3.780+ software apps that runs on Windows and 470+ related software categories. By expanding this list of keywords and categories, Doreid came up with a CSV file containing the training dataset for our generator.
The first rows in the training dataset.
After several iterations, I was able to define the top performing configuration by applying the values below. I moved from character-level to word-level and this greatly increased the speed of the training. As you can see I have 6 layers with 128 cells on each layer and I am running the training for 100 epochs. This is indeed limited, depending on the size of the dataset, by the fact that Google Colab after 4 hours of training stops the session (this is also a gentle reminder that it might be the right time to move from Google Colab to Cloud Datalab – the paid version in Google Cloud).
Here we see the initial keywords being generated while training the model
Rock & Roll, the fun part
After a few hours of training, the model was ready to generate our never-seen-before search intents with a simple python script containing the following lines.
Here a few examples of generated queries:
where to find google drive downloads
where to find my bookmarks on google chrome
how to change your turn on google chrome
how to remove invalid server certificate error in google chrome
how to delete a google account from chrome
how to remove google chrome from windows 8 mode
how to completely remove google chrome from windows 7
how do i remove google chrome from my laptop
You can play with temperatures to improve the creativity of the results or provide a prefix to indicate the first words of the keyword that you might have in mind and let the generator figure out the rest.
Takeaways and future work
“Smart Reply” suggestions can be applied to keyword research workand is worth assessing in a systematic way the quality of these suggestions in terms of:
validity – is this meaningful or not? Does it make sense for a human?
relevance – is this query really hitting on the target audience the website has? Or is it off-topic? and
impact – is this keyword well-balanced in terms of competitiveness and volume considering the website we are working for?
The initial results are promising, all of the initial 200+ generated queries were different from the ones in the training set and, by increasing the temperature, we could explore new angles on an existing topic (i.e. “where is area 51 on google earth?”) or even evaluate new topics (ie. “how to watch android photos in Dropbox” or “advertising plugin for google chrome”).
It would be simply terrific to implement – with a Generative Adversarial Network (or using Reinforcement Learning) – a way to help the generator produce only valuable keywords (keywords that – given the website – are valid, relevant and impactful in terms of competitiveness and reach). Once again, it is crucial to define the right mix of keywords we need to train our model (can we source them from a graph as we did in this case? shall we only use the top ranking keywords from our best competitors? Should we mainly focus on long tail, conversational queries and leave out the rest?).
One thing that emerged very clearly is that: experiments like this one (combining LSTMs and data sourcing using public knowledge graphs such as Wikidata) are a great way to shed some light on how Google might be working in improving the evaluation of search queries using neural nets. What is now called “Neural Matching” might most probably be just a sexy PR expression but, behind the recently announced capability of analyzing long documents and evaluating search queries, it is fair to expect that Google is using RNNs architectures, contextual word embeddings, and semantic similarity. As deep learning and AI, in general, becomes more accessible (frameworks are open source and there is a healthy open knowledge sharing in the ML/DL community) it becomes evident that Google leads the industry with the amount of data they have access to and the computational resources they control.
This experiment would not have been possible without textgenrnn by Max Woolf and TensorFlow. I am also deeply thankful to all of our VIP clients engaging in our SEO management services, our terrific VIP team: Laura, Doreid, Nevine and everyone else constantly “lifting” our startup, Theodora Petkova for challenging my robotic mind 😅and my beautiful family for sustaining my work.
Get to know your content and turn insight into action
The more you know about content, the easier it is to reach your readers by capturing search engines traffic. We’re happy to introduce a new dashboard to help you understand your content and to improve your editorial plan.
WordLift’s knowledge graph is the semantic representation of the content on your website. Every article and every page is annotated with one or more entities. These entities are accessible on the front-end as web pages (topical hubs) or are simply used by WordLift to add structured data markup depending on the way the plugin is configured.
The new content Dashboard
Your new Dashboard will help you quickly take action on insights about your content, including:
Most relevant entities. Find out which concepts are more prominent on your site (prominent = highest number of annotated posts), so you know if this is what you’re aiming for.
More connected entities. Spot entities that are mostly connected with other entities; these are concepts that help you build the context for your readers (they explain the things you talk about). You might want to create more articles around it when you have an entity that has lots of links to other entities and fewer articles.
Target articles to enrich that have not yet been WordLifted (one single click to Enrich) or focus your attention on improving your entities (with just one click on Boost). These links take you to the list of articles and the list entities that you can improve. Jump right on it and check it out yourself.
What are you waiting for? Let WordLift analyze your content for you. Download our plugin and unleash the power of semantic technologies.
Take full control of your search rankings
Find out the content that really ranks on Google
After many meetings, an endless number of sketches commits on git and tons of love, we are really happy to bring a brand new search ranking tool to our Editorial and Business subscribers, developed in partnership with Woorank.
Connecting with your audience isn’t just about ranking high on Google with a single keyword. When we write, in order to be relevant for our audience and following the introduction of Hummingbird, we need to focus on topical hubs.
To truly optimize your site for Google’s new semantic understanding of the user queries we have to think in terms of entities and not just keywords.
Moreover, we have to consider:
the connections between entities and
how these relationships help us build the context for the content that we are producing
To help you make this switch from keywords to entities we have created a tool that helps you track your rankings using the entities in the vocabulary of your website.
Here is how it works:
The Keywords configuration panel
You get to choose the keywords that matter the most for your business (or the top 200 keywords the site ranks for) and WordLift will track the rankings on a daily basis across the entire site.
The Search Rankings widget on this blog
Under Search Rankings, you will find the entities that are driving the organic traffic on your website (the larger the tile and the more the concept is standing out on Google).
This data helps you immediately see if you are connectingwith the right audience. Instead of scanning hundreds (or even thousands) of combinations of keywords and pages, in one single treemap, you can see what entities matter the most. Behind each concept, you might have one single page (i.e. the page for that entity) or hundreds of pages that you have annotated with that entity.
Here are just a few of the ways that you can use to turn this data into action:
Click on one entity and you will see the list of pages behind it and, in the table below, the different types of keywords that this topic is intercepting. Go ahead and build new pages for this topic or improve the already existing content to match what users are searching for.
Here is a quick overview of the cluster behind “Named Entity Recognition” on our website.
Is the entity relevant at all? If it is, how many pages, do you have on this topic? Would it better to expand this cluster by writing more about it? Then go ahead and start creating fresh new content for it.
Click on the three dots in the right bottom corner and keep on exploring all the other concepts that are driving organic traffic to your site. The more you dig the more you will explore what you are relevant for in the eyes of Google and your readers (higher levels in the treemap correspond to higher traffic volumes).
To calculate the size of each tile, WordLift is using an algorithm that we created similar to Google’s Pagerank to assess how much an entity is relevant in terms of search rankings on your site.
WordLift takes into account, with its Entityrank, how many pages have been annotated with that entity, how many other entities have been used to classify these pages, the search traffic the entity page is getting and the search traffic each keyword is bringing to the cluster of all the annotated pages.
This data is a real treasure box to help you boost the ROI of your content. Is the content you are writing the same content that people find on Google? What are the entities with a higher return for your site (i.e. “Artificial Intelligence” for us as well as “SEO” are responsible for the activation of the new trials)?
This and many other questions that you might have on your organic reach have an immediate answer with this widget. It’s time to find out — and to gain new insights in just a few clicks.
Making sense of data using AI is becoming crucial to our daily lives and has significantly shaped my professional career in the last 5 years.
When I began working on the Web it was in the mid-nineties and Amazon was still a bookseller with a primitive website.
At that time it became extremely clear that the world was about to change and every single aspect of our society in both cultural and economic terms was going to be radically transformed by the information society. I was in my twenties, eager to make a revolution and the Internet became my natural playground. I dropped out of school and worked day and night contributing to the Web of today.
Twenty years after I am witnessing again to a similar – if not even more radical – transformation of our society as we race for the so-called AI transformation. This basically means applying machine learning, ontologies and knowledge graphs to optimize every process of our daily lives.
At the personal level I am back in my twenties ? (sort of) and I wake up at night to train a new model, to read the latest research paper on recurrent neural networks or to test how deep learning can be used to perform tasks on knowledge graphs.
The beauty of it is that I have the same feeling of building the plane as we’re flying it that I had in the mid-nineties when I started with TCP/IP, HTML and websites!
Wevolver: an image I took at SXSW
AI transformation for search engine optimization
In practical terms, the AI transformation here at WordLift (our SEO startup) works this way: we look at how we help companies improve traffic coming from search engines. We analyze complex tasks and break them down into small chunks of work and we try to automate them using narrow AI techniques (in some cases we simply tap at the top of the AI pyramid and start using ready-made APIs, in some other cases we develop/train our own models). We tend to focus (in this phase at least) to trivial repetitive tasks that can bring a concrete and measurable impact on the SEO of a website (i.e. more visits from Google, more engaged users, …) such as:
We test these approaches on a selected number of terrific clients that literally fuel this process, we keep on iterating and improving the tooling we use until we feel ready to add it back into our product to make it available to hundreds of other users.
All along the journey, I’ve learned the following lessons:
1. The AI stack is constantly evolving
AI introduces a completely new paradigm: from teaching computers what to do, to providing the data required for computers to learn what to do.
In this pivotal change, we still lack the infrastructure required to address fundamental problems (i.e. How do I debug a model? How can I prevent/detect a bias in the system? How can I predict an event in the context in which the future is not a mere projection of the past?). This basically means that new programming languages will emerge and new stacks shall be designed to address these issues right from the beginning. In this continuing evolving scenario libraries like TensorFlow Hub represent a concrete and valuable example of how the consumption of reusable parts in AI and machine learning can be achieved. This approach also greatly improves the accessibility of these technologies by a growing number of people outside the AI community.
2. Semantic data is king
AI depends on data and any business that wants to implement AI inevitably ends up re-vamping and/or building a data pipeline: the way in which the data is sourced, collected, cleaned, processed, stored, secured and managed. In machine learning, we no longer use if-then-else rules to instruct the computer but we instead let the computer learn the rules by providing a training set of data. This approach, while extremely effective, poses several issues as there is no way to explain why a computer has learned a specific behavior from the training data. In Semantic AI, knowledge graphs are used to collect and manage the training data, and this allows us to check the consistency of this data and to understand, more easily, how the network is behaving and where we might have a margin for improvement. Real-world entities and the relationships between them are becoming essential building blocks in the third era of computing. Knowledge graphs are also great in “translating” insights and wisdom from domain experts in a computable form that machine can understand.
3. You need the help of subject-matter experts
Knowledge becomes a business asset when it is properly collected, encoded, enriched and managed. Any AI project you might have in mind always starts with a domain expert providing the right keys to address the problem. In a way, AI is the most human-dependent technology of all times. For example, let’s say that you want to improve your SEO for images on your website. You will start by looking at best practices and direct experiences of professional SEOs that have been dealing with this issue for years. It is only through the analysis of the methods that this expert community would use that you can tackle the problem and implement your AI strategy. Domain experts know, clearly in advance, what can be automated and what are the expected results from this automation. A data analyst or an ML developer would think that we can train an LSTM network to write all the meta-descriptions of a website on-the-fly. A domain expert would tell you that Google only uses meta descriptions 33% of the times as search snippets and that, if these texts are not revised by a qualified human, they will produce little or no results in terms of actual clicks (we can provide a decent summary with NLP and automatic text summarization but enticing a click is a different challenge).
4. Always link data with other data
External data linked with internal data helps you improve how the computer will learn about the world you live in. Rarely an organization controls all the data that an ML algorithm would need to become useful and to have a concrete business impact. By building on top of the Semantic Web and Linked Data, and by connecting internal with external data we can help machines get smarter. When we started designing the new WordLift’s dashboard whose goal is to help editors improve their editorial plan by looking at how their content ranks on Google, it immediately became clear that our entity-centric world would have benefited from query and ranking data gathered by our partner WooRank. The combination of these two pieces of information helped us create the basis for training an agent that will recommend editors what is good to write and if they are connecting with the right audience over organic search.
To shape your AI strategy and improve both technical and organizational measures we need to study carefully the business requirements with the support of a domain expert and remember that, narrow AI helps us build agentive systems that do things for end-users (like, say, tagging images automatically or building a knowledge graph from your blog posts) as long as we always keep the user at the center of the process.
Ready to transform your marketing strategy with AI?Let's talk!
In this article, we explore how to evaluate the correspondence between title tags and the keywords that people use on Google to reach the content they need. We will share the results of the analysis (and the code behind) using a TensorFlow model for encoding sentences into embedding vectors. The result is a list of titles that can be improved on your website.
“A title tag is an HTML element that defines the title of the page. Titles are one of the most important on-page factors for SEO. […]
They are used, combined with meta descriptions, by search engines to create the search snippet displayed in search results.”
Every search engine’s most fundamental goal is to match the intent of the searcher by analyzing the query to find the best content on the web on that specific topic. In the quest for relevancy a good title influence search engines only partially (it takes a lot more than just matching the title with the keyword to rank on Google) but it does have an impact especially on top ranking positions (1st and 2nd according to a study conducted a few years ago by Cognitive SEO). This is also due to the fact that a searcher is likely inclined to click when they find good semantic correspondence between the keyword used on Google and the title (along with the meta description) displayed in the search snippet of the SERP.
What is semantic similarity in text mining?
Semantic similarity defines the distance between terms (or documents) by analyzing their semantic meanings as opposed to looking at their syntactic form.
“Apple” and “apple” are the same word and if I compute the difference syntactically using an algorithm like Levenshtein they will look identical, on the other hand, by analyzing the context of the phrase where the word apple is used I can “read” the true semantic meaning and find out if the word is referencing the world-famous tech company headquartered in Cupertino or the sweet forbidden fruit of Adam and Eve.
A search engine like Google uses NLP and machine learning to find the right semantic match between the intent and the content. This means the search engines are no longer looking at keywords as strings of text but they are reading the true meaning that each keyword has for the searcher. As SEO and marketers, we can also now use AI-powered tools to create the most authoritative content for a given query.
There are two main ways to compute the semantic similarity using NLP:
we can compute the distance of two terms using semantic graphsand ontologies by looking at the distance between the nodes (this is how our tool WordLift is capable of discerning if apple – in a given sentence – is the company founded by Steve Jobs or the sweet fruit). A very trivial, but interesting example is to, build a “semantic tree” (or better we should say a directed graph) by using the Wikidata P279-property (subclass of).
we can alternatively use a statistical approach and train a deep neural network to build – from a text corpus (a collection of documents), a vector space model that will help us transform the terms in numbers to analyze their semantic similarity and run other NLP tasks (i.e. classification).
There is a crucial and essential debate behind these two approaches. The essential question being: is there a path by which our machines can possess any true understanding? Our best AI efforts after all only create an illusion of an understanding. Both rule-based ontologies and statistical models are far from producing a real thought as it is known in cognitive studies of the human brain. I am not going to expand here but, if you are in the mood, read this blog post on the Noam Chomsky / Peter Norvig debate.
Text embeddings in SEO
Word embeddings (or text embeddings) are a type of algebraic representation of words that allows words with similar meaning to have similar mathematical representation. A vector is an array of numbers of a particular dimension. We calculate how close or distant two words are by measuring the distance between these vectors.
In this article, we’re going to extract embedding using the tf.Hub Universal Sentence Encoder, a pre-trained deep neural network designed to convert text into high dimensional vectors for natural language tasks. We want to analyze the semantic similarity between hundreds of combinations of Titles and Keywords from one of the clients of our SEO management services. We are going to focus our attention on only one keyword per URL, the keyword with the highest ranking (of course we can also analyze multiple combinations). While a page might attract traffic on hundreds of keywords we typically expect to see most of the traffic coming from the keyword with the highest position on Google.
We are going to start from the original code developed by the TensorFlow Hub team and we are going to use Google Colab (a free cloud service with GPU supports to work with machine learning). You can copy the code I worked on and run it on your own instance.
Our starting point is a CSV file containing Keyword, Position (the actual ranking on Google) and Title. You can generate this CSV from the GSC or use any keyword tracking tool like Woorank, MOZ or Semrush. You will need to upload the file to the session storage of Colab (there is an option you can click in the left tray) and you will need to update the file name on the line that starts with:
df = pd.read_csv( … )
Here is the output.
Let’s get into action. The pre-trained model comes with two flavors: one trained with a Transformer encoder and another trained with a Deep Averaging Network (DAN). The first one is more accurate but has higher computational resource requirements. I used the transformer considering the fact that I only worked with a few hundreds of combinations.
In the code below we initiate the module, open the session (it takes some time so the same session will be used for all the extractions), get the embeddings, compute the semantic distance and store the results. I did some tests in which I removed the site name, this helped me see things differently but in the end, I preferred to keep whatever a search engine would see.
The semantic similarity – the degree to which the title and the keyword carry the same meaning – is calculated, as the inner products of the two vectors.
An interesting aspect of using word embeddings from this model is that – for English content – I can easily calculate the semantic similarity of both short and long text. This is particularly helpful when looking at a dataset that might contain very short keywords and very long titles.
The result is a table of combinations from rankings between 1 and 5 that have the least semantic similarity (Corr).
It is interesting to see that it can help, for this specific website, to add to the title the location (i.e. Costa Rica, Anguilla, Barbados, …).
With a well-structured data markup we are already helping the search engine disambiguate these terms by specifying the geographical location, but for the user making the search, it might be beneficial to see at a glance the name of the location he/she is searching for in the search snippet. We can achieve this by revising the title or by bringing more structure in the search snippets using schema:breadcrumbs to present the hierarchy of the places (i.e. Italy > Lake Como > …).
In this scatter plot we can also see that the highest semantic similarity between titles and keywords has an impact on high rankings for this specific website.
Semantic Similarity between keywords and titles visualized
Start running your semantic content audit
Crawling your website using natural language processing and machine learning to extract and analyze the main entities, greatly helps you improve the findability of your content. Adding semantic rich structured data in your web pages helps search engines match your content with the right audience. Thanks to NLP and deep learning I could see that to reduce the gap – between what people search and the existing titles – it was important – for this website – to add the Breadcrumbs markup with the geographical location of the villas. Once again AI, while still incapable of true understanding, helps us become more relevant for our audience (and it does it at web scele on hundreds of web pages).
Solutions like the TF-Hub Universal Encoder bring, in the hands of SEO professionals and marketers, the same AI-machinery that modern search engines like Google use to compute the relevancy of content. Unfortunately, this specific model is limited to English only.
Are you ready to run your first semantic content audit?