SEO automation is the process of using software to optimize a website’s performance programmatically. This article focuses on what you can do with the help of artificial intelligence to improve the SEO of your website. Let’s first remove the elephant in the room: SEO is not a solved problem (yet), and while we, as toolmakers, struggle to alleviate the work of web editors on one side while facilitating the job of search engines on the other, SEO automation is still a continually evolving field, and yes, a consistent amount of tasks can be fully automated, but no, the entire SEO workflow is still way too complicated to be entirely automated. There is more to this: Google is a giant AI, and adding AI to our workflow can help us interact at a deeper level with Google’s giant brain. We see this a lot with structured data; the more we publish structured information about our content, the more Google can improve its results and connect with our audience.
When it comes to search engine optimization, we are typically overwhelmed by the amount of manual work that we need to do to ensure that our website ranks well in search engines. So, let’s have a closer look at the workflow to see where SEO automation can be a good fit..
Technical SEO: Analysis of the website’s technical factors that impact its rankings, focusing on website speed, UX (Web Vitals), mobile response, and structured data.
On-Page SEO: Title Tag, Meta Descriptions, and Headings.
Automation: Here is where AI/deep learning brings value. We can train language models specifically designed for any content optimization task (i.e., creating meta descriptions or, as shown here by @hamletBatista, title tag optimization). We can also use natural language processing (like we do with WordLift) to improve our pages’ structured data markup 🤩.
Off-page SEO: Here, the typical task would be creating and improving backlinks.
SEO strategy: Traffic pattern analysis, A/B testing, and future predictions.
Automation: here also we can use machine learning for time series forecasting. A good starting point is this blog post by @JR Oaks. We can use machine learning models to predict future trends and highlight the topics for which a website is most likely to succeed. Here we would typically see a good fit with Facebook’s library Prophet or Google’s Causal Impact analysis.
Will Artificial Intelligence Solve SEO?
AI effectively can help us across the entire SEO optimization workflow. Some areas are, though, based on my personal experience, more rewarding than others. Still, again – there is no one size fix all and, depending on the characteristics of your website, the success recipe might be different. Here is what I see most rewarding across various verticals.
Structured data is one of these areas in SEO where automation realistically delivers a scalable and measurable impact on your website’s traffic. Google is also focusing more and more on structured data to drive new features on its result pages. Thanks to this, it is getting simpler to drive additional organic traffic and calculate the investment return.
Finding new untapped content ideas with the help of AI
Here at WordLift, we have developed our tool for intent discovery that helps our clients gather ideas using Google’s suggestions. The tool ranks queries by combining search volume, keyword competitiveness and if you are already using WordLift, your knowledge graph. This comes in handy as it helps you understand if you are already covering that specific topic with your existing content or not. Having existing content on a given topic might help you create a more engaging experience for your readers.
We give early access to our upcoming tools and features to a selected number of clients. Do you want to join our VIP Program?
Here is where I expect to see the broadest adoption of AI by marketers and content writers worldwide. With a rapidly growing community of enthusiasts, it is evident that AI will be a vital part of content generation. New tools are coming up to make life easier for content writers, and here are a few examples to help you understand how AI can improve your publishing workflow.
Creating SEO-Driven Article Outlines
We can train autoregressive language models such as GPT-3 that use deep learning to produce human-like text. Creating a full article is possible, but the results might not be what you would expect. Here is an excellent overview by Ben Dickson that demystifies AI in the context of content writing and helps us understand its limitations.
There is still so much that we can do to help writing be more playful and cost-effective. One of the areas where we’re currently experimenting is content outlining. Writing useful outlines helps us structure our thoughts, dictates our articles’ flow, and is crucial in SEO (a good structure will help readers and search engines understand what you are trying to say). Here is an example of what can be done in this area.
I provide a topic such as “SEO automation” and I get the following outline proposals:
What is automation in SEO?
How it is used?
How it is different from other commonly used SEO techniques?
You still have to write the best content piece on the Internet to rank, but using a similar approach can help you structure ideas faster.
Crafting good page titles for SEO
Creating a great title for SEO boils down to:
helping you rank for a query (or search intent);
entice the user to click through to your page from the search results
It’s a magical skill that marketers acquire with time and experience. And yes, this is the right task for SEO automation as we can infuse the machine with learning samples by looking at the best titles on our website. Here is one more example that we’re working on: a trained model that can come up with great title suggestions given a few topics. Let’s try it out. Here I am adding two topics: SEO automation and AI (quite obviously).
The result is valuable, and most importantly, the model is stochastic, so if we try the same combination of topics multiple times each time, the model generates a new title.
Generating meta descriptions that work
Also, we can unleash deep learning and craft the right snippet for our pages or at least provide the editor with a first-draft to start with for meta description. Here is an example of an abstractive summary for this blog post.
Creating FAQ content on scale
The creation of FAQ content can be partially automated by analyzing popular questions from Google and Bing and providing a first draft response using deep learning techniques. Here is the answer that I can generate for “Is SEO important in 2021?”
DISCLAIMER: these tools are not yet part of WordLift but are being tested with a selected number of clients. Do you want to join our VIP Program to automate your SEO? Drop us an email.
How Does SEO Automation Work?
Here is how you can proceed when approaching SEO automation. It is always about finding the right data, identifying the strategy, and running A/B tests to prove your hypothesis before going live on thousands of web pages.
It is also essential to distinguish between:
Deterministic output – where I know what to expect and
Stochastic output – where the machine might generate a different variation every time, and we will need to keep a final human validation step.
I believe that the future of SEO automation and the contribution of machine/deep learning to digital marketing is now. SEOs have been automating their tasks for a while now, but SEO automation tools using AI are just starting to take off and significantly improve traffic and revenues.
Are you Interested in trying out WordLift Content Intelligence solutions to scale up your content production? Book a meetingwith one of our experts or drop us a line.
The image we used in this blog post is a series of fantastical objects generated by OpenAI’s DALL·E new model by combining two unrelated ideas (clock and mango).
In the last two decades, text summarization has played an essential role in search engine optimization (SEO). There are, indeed, a lot of different marketing techniques that require a summary of the content, and that can improve ranking in search engines. Meta descriptions are probably among the most notable examples (here is a video tutorial that Andrea did on generating meta descriptions).
These text snippets provide search engines with a brief description of the page content and are still an important ranking factor and one of the most common use cases for text summarization.
Thanks to the latest NLP technologies, SEO specialists can finally summarize the content of entire webpages using algorithms that craft easy-to-read summaries.
In this article, we will discuss the importance of using text summarization in the context of SEO and digital marketing.
Summaries help create and structure the metadata that describes web pages. Text summarization also comes in handy when we want to add descriptive text to category pages for magazines and eCommerce websites or when we need to prepare the copy for promoting our latest article to Facebook or Twitter. Much like search engines use meta descriptions, social networks rely on their meta descriptors like the Facebook Open Graph meta tag (a.k.a. OG tag) to distribute content to their users. Facebook for instance, uses the summary provided in OG tags to create the card that promotes a blog post on mobile and desktop devices.
Extractive vs Abstractive
There are many different text summarization approaches, and they vary depending on the number of input documents, purpose, and output. But, they all fall into one of two categories: extractive or abstractive.
Extractive Text Summarization
Extractive summarization algorithms identify essential sections of a text and generate verbatim to produce a subset of the sentences from the original input.
Extractive summaries are reliable because they will not change the meaning of any sentence. They are generally easier to program. It’s very logical, and in the most straightforward implementations, the most common words in the source text are the words that represent the main topic. Using today’s pre-trained Transformer models with their ground-breaking performance, we can achieve excellent results with the extractive approach.
In WordLift, for instance, BERT is used to boost the performance of extractive summarization across different languages. Here is the summary that WordLift creates for me for this article that you are reading.
In the last two decades, text summarization has played an essential role in search engine optimization (SEO). While our existing BERT-based summarization API performs well in German, we wanted to create unique content instead of only shrinking the existing text.
It is quite useful in summarizing, using the most important sentences, the content that I am writing here, and it is formally correct (or at least as valid as my writing) but not unique.
Abstractive Text Summarization
Abstractive methodologies summarize texts differently, using deep neural networks to interpret, examine, and generate new content (summary), including essential concepts from the source.
Abstractive approaches are more complicated: you will need to train a neural network that understands the content and rewrites it.
In general, training a language model to build abstract summaries is challenging and tends to be more complicated than using the extractive approach. Abstractive summarization might fail to preserve the meaning of the original text and generalizes less than extractive summarization.
As humans, when we try to summarize a lengthy document, we first read it entirely very carefully to develop a better understanding; secondly, we write highlights for its main points. As computers lack most of our human knowledge, they still struggle to generate summaries’ human-level quality.
Moreover, using abstractive approaches also poses the challenge of supporting multilingualism. The model needs to be trained for each language separately.
Training our new model for German using Google’s T5
As part of the WordLift NG project and, on behalf of one of our German-speaking clients, we ventured into creating a new pre-trained language model for automatic text summarization in German. While our existing BERT-based summarization API performs well in German, we wanted to create unique content instead of only shrinking the existing text.
This new language model is revolutionary as we can re-use the same model for different NLP tasks, including summarization. T5 is also language-agnostic, and we can use it with any language.
We successfully trained the new summarizer on a dataset of 100,000 texts together with reference summaries extracted from the German Wikipedia. Here is a result where we see the input summary that was provided along with the full text of the Wikipedia page on Leopold Wilhelm and the predicted summary generated by the model.
Conclusions and future work
We are very excited about this new line of work and we will continue experimenting new ways to help editors, SEOs and website owners improve their publishing workflows with the help of NLP, knowledge graphs and deep learning!
WordLift provides websites with all the AI you need to grow your traffic — whether you want to increase leads, accelerate sales on your e-commerce or build a powerful website. Let’s talk with our experts to find out more!
Let’s start with the end. In the experiment I am sharing today we measured the impact of a specific improvement on the structured data of a website that references 500+ Local Business (more specifically the site promotes Lodging Business such as hotels and villas for rent). Before diving into the solution; let’s have a look at the results that we obtained using a Causal Impact analysis. If you are a marketing person or an SEO you constantly struggle to measure the impact of your actions in the most precise and irrefutable way; Casual Impact, a methodology originally introduced by Google, helps you exactly with this. It’s a statistical analysis that builds a Bayesian structural time series model that helps you isolate the impact of a single change being made on a digital platform.
In a week, after improving the existing markup, we could see a positive increase of +5.09% of clicks coming from Google Search – this improvement is statistically relevant, unlikely to be due to random fluctuations and the probability of obtaining this effect by chance is very small 🔥🔥
We did two major improvements to the markup of these local businesses:
Improve the quality of NAP (Name, Address and Phone number) by reconciling the entities with entities in Google My Business (viia Google Maps APIs) and by making sure we had the same data Google has or better;
Adding, for all the reconciled entities, the hasMap property with a direct link to the Google CID Number (Customer ID Number), this is an important identifier that business owners and webmasters should know – it helps Google match entities found by crawling structured data with entities in GMB.
Google My Business is indeed the simplest and most effective way for a local business to enter the Google Knowledge Graph. If your site operates in the travel sector or provides users with immediate access to hundreds of local businesses, what should you do to market your pages using schema markup against a fierce competition made of the business themselves or large brands such as booking.com and tripadvisors.com?
How can you be more relevant for both travelers abroad searching for their dream holiday in another country and for locals trying to escape from large urban areas?
The approach, in most of our projects, is the same regardless of the vertical we work for: knowledge completion and entity reconciliation; these reallyare two essential building blocks of our SEO strategy.
By providing more precise information in the form of structured linked data we are helping search engines find the searchers we’re looking for, at the best time of their customer journey.
Another important aspect is that, while we’re keen on automating SEO (and data curation in general), we understand the importance of the continuous feedback loop between humans and machines: domain experts need to be able to validate the output and to correct any inaccurate predictions that the machine might produce.
There is no way out – tools like WordLift needs to facilitate the process and web scale it but they cannot replace human knowledge and human validation (not yet at least).
LocalBusiness markup works for different types of businesses from a retail shop to a luxury hotel or a shopping center and it comes with sub-types (here is the full list of the different variants from the schema.org website).
All the sub-types, when it comes to SEO and Google in particular, shall contain the following set of information:
Name, Address and Phone number (and here consistency plays a big role and we want to ensure that the same entity on Yelp shows the same data on Apple Maps, Google, Bing and all the other directories that clients might use)
Reference to the official website (this becomes particularly relevant if the publisher does not coincide with the business owner)
Reference to the Google My Business entity (the 5% lift – we have seen above is indeed related to this specific piece of information) using the hasMap property
Location data (and here, as you might image, we can do a lot more than just adding the address as a string of text)
In order to improve the markup and to add the hasMap property on hundreds of pages we’ve added a new functionality in WordLift’s WordPress plugin (that also works already for non-WordPress websites) that helps editors:
Trigger the reconciliation using Google Maps APIs
Review/Approve the suggestions
Improve structured data markup for Local Business
From the screen below the editor can either “Accept” or “Discard” the provided suggestions.
WordLift reconciles an entity with a loose match with the name of the business, the address and/or the phone number.
Adding location markup using containedInPlace/containsPlace and linked data
As seen in the json-ld above we have added – in a previous iteration (and independently from the testing that was done this time) two important properties:
the inverse-property containsPlace (on the pages related to villages and regions) to help search engines clearly understand the location of the local businesses.
This data is also very helpful to compose the breadcrumbs as it will help the searcher understand and confirm the location of a business. Most of us, still make searches like “WordLift, Rome” to find a local business and more likely we will click on results where we can confirm that – yes, WordLift office is indeed located in Italy > Lazio > Rome.
To extract this information along with the sameAs links to Wikidata and GeoNames (one of the largest geographical databases with more than 11 million locations) we used our linked data stack and an extension called WordLift Geo to automatically populate the knowledge graph and the JSON-LD with the containedInPlace and containsPlace properties.
We have seen a +5.09% increase in clicks (after only one week) on pages where we added the hasMap property and improved the consistency of NAP (business name, address and phone number) on a travel website listing over 500+ local businesses
We did this by interfacing the Google Maps Places APIs and by providing suggestions for the editor to validate/reject the suggestions
Using containedInPlace/containsPlace is also a good way to improve the structured data of a local business and you should do this by adding also sameAs links to Wikidata and/or GeoNames to facilitate disambiguation
As most of the searches for local businesses (at least in travel) are in the form of “[business name][location where the business is located]”; we have seen in the past an increased in the CTR when schema Breadcrumb use this information from containedInPlace/containsPlace (see below 👇)
One key aspect in SEO, if you are a local business (or deal with local business), is to have the correct location listed in Google Maps and link your website with Google My Business. The best way to do that is to properly markup your Google Map URL using schema markup.
What is the hasMap property and how should we use it? In 2014 (schema v 1.7) the hasMap property was introduced to link a web page of a place with the URL of a map. In order to facilitate the link between a web page and the corresponding entity on Google Maps we can use the following snippet in the JSON-LD “hasMap”: “https://maps.google.com/maps?cid=YOURCIDNUMBER”
What is the Google CID number? Google customer ID (CID) is a unique number used to identify a Google Ads account. This number can be used to link a website with the corresponding entity in Google My Business.
How can I find the Google CID number using Google Maps? Search the business in Google Maps using the business nameView the source code (use view-source: followed by the url in your browser)Click CTRL+F and search the source code for “ludocid”The CID will be the string of numbers after “ludocid\\u003d” and before #lrd. You can alternatively use this Chrome extension.
SERP analysis is an essential step in the process of content optimization to outrank the competition on Google. In this blog post I will share a new way to run SERP analysis using machine learning and a simple python program that you can run on Google Colab.
SERP (Search Engine Result Page) analysis is part of keyword research and helps you understand if the query that you identified is relevant for your business goals. More importantly by analyzing how results are organized we can understand how Google is interpreting a specific query.
What is the intention of the usermaking that search?
What search intent Google is associating with that particular query?
The investigative work required to analyze the top results provide an answer to these questions and guide us to improve (or create) the content that best fit the searcher.
While there is an abundance of keyword research tools that provide SERP analysis functionalities, my particular interest lies in understanding the semantic data layer that Google uses to rank results and what can be inferred using natural language understanding from the corpus of results behind a query. This might also shed some light on how Google does fact extraction and verification for its own knowledge graph starting from the content we write on webpages.
Falling down the rabbit hole
It all started when Jason Barnard and I started to chat about E-A-T and what technique marketers could use to “read and visualize” Brand SERPs. Jason is a brilliant mind and has a profound understanding of Google’s algorithms, he has been studying, tracking and analyzing Brand SERPs since 2013. While Brand SERPs are a category on their own the process of interpreting search results remains the same whether you are comparing the personal brands of “Andrea Volpini” and “Jason Barnard” or analyzing the different shades of meaning between “making homemade pizza” and “make pizza at home”.
Hands-on with SERP analysis
In this pytude (simple python program) as Peter Norvig would call it, the plan goes as follow:
we will crawl Google’s top (10-15-20) results and extract the text behind each webpage,
we will look at the terms and the concepts of the corpus of text resulting from the download, parsing, and scraping of web page data (main body text) of all the results together,
we will then compare two queries “Jason Barnard” and “Andrea Volpini” in our example and we will visualize the most frequent terms for each query within the same semantic space,
After that we will focus on “Jason Barnard” in order to understand the terms that make the top 3 results unique from all the other results,
Finally using a sequence-to-sequence model we will summarize all the top results for Jason in a featured snippet like text (this is indeed impressive),
At last we will build a question-answering modelon top of the corpus of text related to “Jason Barnard” to see what facts we can extract from these pages that can extend or validate information in Google’s knowledge graph.
Text mining Google’s SERP
Our text data (Web corpus) is the result of two queries made on Google.com (you can change this parameter in the Notebook) and of the extraction of all the text behind these webpages. Depending on the website we might or might not be able to collect the text. The two queries I worked with are “Jason Barnard” and “Andrea Volpini” but you can query of course whatever you like.
One of the most crucial work, once the Web corpus has been created, in the text mining field is to present data visually. Using natural language processing (NLP) we can explore these SERPs from different angles and levels of detail. Using Scattertext we’re immediately able to see what terms (from the combination of the two queries) differentiate the corpus from a general English corpus. What are, in other words, the most characteristic keywords of the corpus.
And you can see here besides the names (volpini, jasonbarnard, cyberandy) other relevant terms that characterize both Jason and myself. Boowa a blue dog and Kwala a yellow koala will guide us throughout this investigation so let me first introduce them: they are two cartoon characters that Jason and his wife created back in the nineties. They are still prominent as they appear on Jason’s article on a Wikipedia as part of his career as cartoon maker.
Visualizing term associations in two Brand SERPs
In the scatter plot below we have on the y-axis the category “Jason Barnard” (our first query), and on the x-axis the category for “Andrea Volpini”. On the top right corner of the chart we can see the most frequent terms on both SERPs – the semantic junctions between Jason and myself according to Google.
Not surprisingly there you will find terms like: Google, Knowledge, Twitter and SEO. On the top left side we can spot Boowa and Kwala for Jason and on the bottom right corner AI, WordLift and knowledge graph for myself.
Comparing the terms that make the top 3 results unique
When analyzing the SERP our goal is to understand how Google is interpreting the intent of the user and what terms Google considers relevant for that query. To do so, in the experiment, we split the corpus of the results related to Jason between the content that ranks in position 1, 2 and 3 and everything else.
Summarizing Google’s Search Results
When creating well-optimized content professional SEOs analyze the top results in order to analyze the search intent and to get an overview of the competition. As Gianluca Fiorelli, whom I personally admire a lot, would say; it is vital to look at it directly.
Since we now have the web corpus of all the results I decided to let the AI do the hard work in order to “read” all the content related to Jason and to create an easy to read summary. I’ve experimented quite a lot lately with both extractive and abstractive summarization techniques and I found that, when dealing with an heterogeneous multi-genre corpus like the one we get from scraping web results, BART (a sequence-to-sequence text model) does an excellent job in understanding the text and generating abstractive summaries (for English).
Let’s it in action on Jason’s results. Here is where the fun begins. Since I was working with Jason Barnard a.k.a the Brand SERP Guy, Jason was able to update his own Brand SERP as if Google was his own CMS 😜and we could immediately see from the summary how these changes where impacting what Google was indexing.
Here below the transition from Jason marketer, musicians and cartoon maker to Jason full-time digital marketer.
Can we reverse-engineer Google’s answer box?
As Jason and I were progressing with the experiment I also decided to see how close a Question Answering System running Google , pre-trained models of BERT, could get to Google’s answer box for the Jason-related question below.
Quite impressively, as the web corpus was indeed, the same that Google uses, I could get exactly the same result.
This is interesting as it tells us that we can use question-answering systems to validate if the content that we’re producing responds to the question that we’re targeting.
Ready to transform your marketing strategy with AI?Let's talk!
Lesson we learned
We can produce semantically organized knowledge from raw unstructured content much like a modern search engine would do. By reverse engineering the semantic extraction layer using NER from Google’s top results we can “see” the unique terms that make web documents stand out on a given query.
We can also analyze the evolution over time and space (the same query in a different region can have a different set of results) ofthese terms.
While with keyword research tools we always see a ‘static’ representation of the SERP by running our own analysis pipeline we realize that these results are constantly changing as new content surfaces the index and as Google’s neural mind improves its understanding of the world and of the person making the query.
By comparing different queries we can find aspects in common and uniqueness that can help us inform the content strategy (and the content model behind the strategy).
Are you ready to run your first SERP Analysis using Natural Language Processing?
All of this wouldn’t happen without Jason’s challenge of “visualizing” E-A-T and brand serps and this work is dedicated to him and to the wonderful community of marketers, SEOs, clients and partners that are supporting WordLift. A big thank you also goes to the open-source technologies used in this experiment:
The Bidirectional Encoder Representations from Transformers (BERT) is an AI developed by Google as a means to help machines understand language in a manner more similar to how humans understand language. Specifically, it’s pre-trained, unsupervised natural language processing (NLP) model that seeks to understand the nuances and context of human language.
It was released as an open-source program by Google in 2018 but had an official launch in November 2019. It is now being used in Google searches in all languages, globally and impacts featured snippets.
What is BERT used for?
BERT is primarily used to provide better query results by using its understanding of language nuance to deliver more useful results. This goes not only for standard snippets, but for featured snippets as well. It’s said that it will impact at least 1 out of every 10 search results going forward.
When BERT uses its understanding of nuance in language, it can understand a user’s intentions through connecting words, such as: and, but, to, from, with, etc. So rather than utilizing only keywords, BERT can understand a user’s query request by examining words like “and” or “verses” in delivering SERP results.
An example of how BERT uses NLP to distinguish a user’s search intent.
In an example provided by Google, if you search for “parking on a hill with no curb,” you would get SERP results and a featured snippet detailing what you need to do if you’re parking a vehicle on a hill where there is no curb. Thanks to BERT’s NLP, Google knows that the word “no” means that there is no curb, whereas previously, if you searched for the same query, you would’ve received results on parking on a hill WITH a curb because your query included the keyword “curb” but Google didn’t understand the significance of the word no.
What is BERTSUM?
BERTSUM is a variant of BERT that is used for extractive summarization of content. Essentially, BERTSUM can be used to extract summaries of web pages and content for several different web pages and sites. This has been known to be particularly useful when writing meta descriptions for hundreds or even thousands of webpages on a site, rather than having to write each one individually.
BERT’s effect on RankBrain
RankBrain, being Google’s first AI used to understand queries, has been used to understand queries and content since 2015. While it shares some things in common with BERT, they do not perform the same functions and BERT has not replaced RankBrain. RankBrain can do things like, understand what a user is looking for even if they misspelled it or used incorrect grammar whereas BERT seeks to understand the nuances of the language used in a search query.
Therefore, while they both share a lot in common and both perform NLP functions for the Google SERP, they are not the same.
One of the most fascinating features of deep neural networks applied to NLP is that, provided with enough examples of human language, they can generate text and help us discover many of the subtle variations in meanings. In a recent blog post by Google research scientist Brian Strope and engineering director Ray Kurzweil we read:
“The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc.”
Following this hierarchical structure, new computational language models, aim at simplifying the way we communicate and have silently entered our daily lives; from Gmail “Smart Reply” feature to the keyboard in our smartphones, recurrent neural network, and character-word level prediction using LSTM (Long Short Term Memory) have paved the way for a new generation of agentive applications.
From keyword research to keyword generation
As usual with my AI-powered SEO experiments, I started with a concrete use-case. One of our strongest publishers in the tech sector was asking us new unexplored search intents to invest on with articles and how to guides. Search marketers, copywriters and SEOs, in the last 20 years have been scouting for the right keyword to connect with their audience. While there is a large number of available tools for doing keyword research I thought, wouldn’t it be better if our client could have a smart auto-complete to generate any number of keywords in their semantic domain, instead than keyword data generated by us?The way a search intent (or query) can be generated, I also thought, is also quite similar to the way a title could be suggested during the editing phase of an article. And titles (or SEO titles), with a trained language model that takes into account what people search, could help us find the audience we’re looking for in a simpler way.
What makes an RNNs “more intelligent” when compared to feed-forward networks, is that rather than working on a fixed number of steps they compute sequences of vectors. They are not limited to process only the current input, but also everything that they have perceived previously in time.
This characteristic makes them particularly efficient in processing human language (a sequence of letters, words, sentences, and paragraphs) as well as music (a sequence of notes, measures, and phrases) or videos (a sequence of images).
Here above you can see the difference between a recurrent neural network and a feed-forward neural network. Basically, RNNs have a short-memory that allow them to store the information processed by the previous layers. The hidden state is looped back as part of the input. LSTMs are an extension of RNNs whose goal is to “prolong” or “extend” this internal memory – hence allowing them to remember previous words, previous sentences or any other value from the beginning of a long sequence.
The LSTM cell where each gate works like a perceptron.
Imagine a long article where I explained that I am Italian at the beginning of it and then this information is followed by other let’s say 2.000 words. An LSTM is designed in such a way that it can “recall” that piece of information while processing the last sentence of the article and use it to infer, for example, that I speak Italian. A common LSTM cell is made of an input gate, an output gate and a forget gate. The cell remembers values over a time interval and the three gates regulate the flow of information into and out of the cell much like a mini neural network. In this way, LSTMs can overcome the vanishing gradient problem of traditional RNNs.
If you want to learn more in-depth on the mathematics behind recurrent neural networks and LSTMs, go ahead and read this article by Christopher Olah.
Let’s get started: “Io sono un compleanno!”
After reading Andrej Karpathy’s blog post I found a terrific Python library called textgenrnn by Max Woolf. This library is developed on top of TensorFlow and makes it super easy to experiment with Recurrent Neural Network for text generation.
Before looking at generating keywords for our client I decided to learn text generation and how to tune the hyperparameters in textgenrnn by doing a few experiments.
AI is interdisciplinary by definition, the goal of every project is to bridge the gap between computer science and human intelligence.
I started my tests by throwing in the process a large text file in English that I found on Peter Norvig’s website (https://norvig.com/big.txt) and I end up, thanks to the help of Priscilla (a clever content writer collaborating with us), “resurrecting” David Foster Wallace with its monumental Infinite Jest (provided in Italian from Priscilla’s ebook library and spiced up with some of her random writings).
At the beginning of the training process – in a character by character configuration – you can see exactly what the network sees: a nonsensical sequence of characters that few epochs (training iteration cycles) after will transform into proper words.
As I became more accustomed to the training process I was able to generate the following phrase:
“Io sono un compleanno. Io non voglio temere niente? Come no, ancora per Lenz.”
“I’m a birthday. I don’t want to fear anything? And, of course, still for Lenz.”
David Foster Wallace
Unquestionably a great piece of literature ?that gave me the confidence to move ahead in creating a smart keyword suggest tool for our tech magazine.
The dataset used to train the model
As soon as I was confident enough to get things working (this means basically being able to find a configuration that – with the given dataset – could produce a language model with a loss value equal or below 1.0), I asked Doreid, our SEO expert to work on WooRank’s API and to prepare a list of 100.000 search queries that could be relevant for the website.
To scale up the number we began by querying Wikidata to get a list of software for Windows that our readers might be interested to read about. As for any ML, project data is the most strategic asset. So while we want to be able to generate never-seen-before queries we also want to train the machine on something that is unquestionably good from the start.
The best way to connect words to concepts is to define a context for these words. In our specific use case, the context is primarily represented by software applications that run on the Microsoft Windows operating system. We began by slicing the Wikidata graph with a simple query that provided us with the list of 3.780+ software apps that runs on Windows and 470+ related software categories. By expanding this list of keywords and categories, Doreid came up with a CSV file containing the training dataset for our generator.
The first rows in the training dataset.
After several iterations, I was able to define the top performing configuration by applying the values below. I moved from character-level to word-level and this greatly increased the speed of the training. As you can see I have 6 layers with 128 cells on each layer and I am running the training for 100 epochs. This is indeed limited, depending on the size of the dataset, by the fact that Google Colab after 4 hours of training stops the session (this is also a gentle reminder that it might be the right time to move from Google Colab to Cloud Datalab – the paid version in Google Cloud).
Here we see the initial keywords being generated while training the model
Rock & Roll, the fun part
After a few hours of training, the model was ready to generate our never-seen-before search intents with a simple python script containing the following lines.
Here a few examples of generated queries:
where to find google drive downloads
where to find my bookmarks on google chrome
how to change your turn on google chrome
how to remove invalid server certificate error in google chrome
how to delete a google account from chrome
how to remove google chrome from windows 8 mode
how to completely remove google chrome from windows 7
how do i remove google chrome from my laptop
You can play with temperatures to improve the creativity of the results or provide a prefix to indicate the first words of the keyword that you might have in mind and let the generator figure out the rest.
Takeaways and future work
“Smart Reply” suggestions can be applied to keyword research workand is worth assessing in a systematic way the quality of these suggestions in terms of:
validity – is this meaningful or not? Does it make sense for a human?
relevance – is this query really hitting on the target audience the website has? Or is it off-topic? and
impact – is this keyword well-balanced in terms of competitiveness and volume considering the website we are working for?
The initial results are promising, all of the initial 200+ generated queries were different from the ones in the training set and, by increasing the temperature, we could explore new angles on an existing topic (i.e. “where is area 51 on google earth?”) or even evaluate new topics (ie. “how to watch android photos in Dropbox” or “advertising plugin for google chrome”).
It would be simply terrific to implement – with a Generative Adversarial Network (or using Reinforcement Learning) – a way to help the generator produce only valuable keywords (keywords that – given the website – are valid, relevant and impactful in terms of competitiveness and reach). Once again, it is crucial to define the right mix of keywords we need to train our model (can we source them from a graph as we did in this case? shall we only use the top ranking keywords from our best competitors? Should we mainly focus on long tail, conversational queries and leave out the rest?).
One thing that emerged very clearly is that: experiments like this one (combining LSTMs and data sourcing using public knowledge graphs such as Wikidata) are a great way to shed some light on how Google might be working in improving the evaluation of search queries using neural nets. What is now called “Neural Matching” might most probably be just a sexy PR expression but, behind the recently announced capability of analyzing long documents and evaluating search queries, it is fair to expect that Google is using RNNs architectures, contextual word embeddings, and semantic similarity. As deep learning and AI, in general, becomes more accessible (frameworks are open source and there is a healthy open knowledge sharing in the ML/DL community) it becomes evident that Google leads the industry with the amount of data they have access to and the computational resources they control.
This experiment would not have been possible without textgenrnn by Max Woolf and TensorFlow. I am also deeply thankful to all of our VIP clients engaging in our SEO management services, our terrific VIP team: Laura, Doreid, Nevine and everyone else constantly “lifting” our startup, Theodora Petkova for challenging my robotic mind ?and my beautiful family for sustaining my work.