SERP analysis is an essential step in the process of content optimization to outrank the competition on Google. In this blog post I will share a new way to run SERP analysis using machine learning and a simple python program that you can run on Google Colab.
|Jump directly to the code: Google SERP Analysis using Natural Language Processing|
SERP (Search Engine Result Page) analysis is part of keyword research and helps you understand if the query that you identified is relevant for your business goals. More importantly by analyzing how results are organized we can understand how Google is interpreting a specific query.
What is the intention of the user making that search?
What search intent Google is associating with that particular query?
The investigative work required to analyze the top results provide an answer to these questions and guide us to improve (or create) the content that best fit the searcher.
While there is an abundance of keyword research tools that provide SERP analysis functionalities, my particular interest lies in understanding the semantic data layer that Google uses to rank results and what can be inferred using natural language understanding from the corpus of results behind a query. This might also shed some light on how Google does fact extraction and verification for its own knowledge graph starting from the content we write on webpages.
Falling down the rabbit hole
It all started when Jason Barnard and I started to chat about E-A-T and what technique marketers could use to “read and visualize” Brand SERPs. Jason is a brilliant mind and has a profound understanding of Google’s algorithms, he has been studying, tracking and analyzing Brand SERPs since 2013. While Brand SERPs are a category on their own the process of interpreting search results remains the same whether you are comparing the personal brands of “Andrea Volpini” and “Jason Barnard” or analyzing the different shades of meaning between “making homemade pizza” and “make pizza at home”.
Hands-on with SERP analysis
In this pytude (simple python program) as Peter Norvig would call it, the plan goes as follow:
- we will crawl Google’s top (10-15-20) results and extract the text behind each webpage,
- we will look at the terms and the concepts of the corpus of text resulting from the download, parsing, and scraping of web page data (main body text) of all the results together,
- we will then compare two queries “Jason Barnard” and “Andrea Volpini” in our example and we will visualize the most frequent terms for each query within the same semantic space,
- After that we will focus on “Jason Barnard” in order to understand the terms that make the top 3 results unique from all the other results,
- Finally using a sequence-to-sequence model we will summarize all the top results for Jason in a featured snippet like text (this is indeed impressive),
- At last we will build a question-answering model on top of the corpus of text related to “Jason Barnard” to see what facts we can extract from these pages that can extend or validate information in Google’s knowledge graph.
Text mining Google’s SERP
Our text data (Web corpus) is the result of two queries made on Google.com (you can change this parameter in the Notebook) and of the extraction of all the text behind these webpages. Depending on the website we might or might not be able to collect the text. The two queries I worked with are “Jason Barnard” and “Andrea Volpini” but you can query of course whatever you like.
One of the most crucial work, once the Web corpus has been created, in the text mining field is to present data visually. Using natural language processing (NLP) we can explore these SERPs from different angles and levels of detail. Using Scattertext we’re immediately able to see what terms (from the combination of the two queries) differentiate the corpus from a general English corpus. What are, in other words, the most characteristic keywords of the corpus.
And you can see here besides the names (volpini, jasonbarnard, cyberandy) other relevant terms that characterize both Jason and myself. Boowa a blue dog and Kwala a yellow koala will guide us throughout this investigation so let me first introduce them: they are two cartoon characters that Jason and his wife created back in the nineties. They are still prominent as they appear on Jason’s article on a Wikipedia as part of his career as cartoon maker.
Visualizing term associations in two Brand SERPs
In the scatter plot below we have on the y-axis the category “Jason Barnard” (our first query), and on the x-axis the category for “Andrea Volpini”. On the top right corner of the chart we can see the most frequent terms on both SERPs – the semantic junctions between Jason and myself according to Google.
Not surprisingly there you will find terms like: Google, Knowledge, Twitter and SEO. On the top left side we can spot Boowa and Kwala for Jason and on the bottom right corner AI, WordLift and knowledge graph for myself.
To extract the entities we use spaCy and an extraordinary library Jason Kassler called Scattertext.
Comparing the terms that make the top 3 results unique
When analyzing the SERP our goal is to understand how Google is interpreting the intent of the user and what terms Google considers relevant for that query. To do so, in the experiment, we split the corpus of the results related to Jason between the content that ranks in position 1, 2 and 3 and everything else.
Summarizing Google’s Search Results
When creating well-optimized content professional SEOs analyze the top results in order to analyze the search intent and to get an overview of the competition. As Gianluca Fiorelli, whom I personally admire a lot, would say; it is vital to look at it directly.
Since we now have the web corpus of all the results I decided to let the AI do the hard work in order to “read” all the content related to Jason and to create an easy to read summary. I’ve experimented quite a lot lately with both extractive and abstractive summarization techniques and I found that, when dealing with an heterogeneous multi-genre corpus like the one we get from scraping web results, BART (a sequence-to-sequence text model) does an excellent job in understanding the text and generating abstractive summaries (for English).
Let’s it in action on Jason’s results. Here is where the fun begins. Since I was working with Jason Barnard a.k.a the Brand SERP Guy, Jason was able to update his own Brand SERP as if Google was his own CMS 😜and we could immediately see from the summary how these changes where impacting what Google was indexing.
Here below the transition from Jason marketer, musicians and cartoon maker to Jason full-time digital marketer.
Can we reverse-engineer Google’s answer box?
As Jason and I were progressing with the experiment I also decided to see how close a Question Answering System running Google , pre-trained models of BERT, could get to Google’s answer box for the Jason-related question below.
Quite impressively, as the web corpus was indeed, the same that Google uses, I could get exactly the same result.
This is interesting as it tells us that we can use question-answering systems to validate if the content that we’re producing responds to the question that we’re targeting.
Lesson we learned
We can produce semantically organized knowledge from raw unstructured content much like a modern search engine would do. By reverse engineering the semantic extraction layer using NER from Google’s top results we can “see” the unique terms that make web documents stand out on a given query.
We can also analyze the evolution over time and space (the same query in a different region can have a different set of results) of these terms.
While with keyword research tools we always see a ‘static’ representation of the SERP by running our own analysis pipeline we realize that these results are constantly changing as new content surfaces the index and as Google’s neural mind improves its understanding of the world and of the person making the query.
By comparing different queries we can find aspects in common and uniqueness that can help us inform the content strategy (and the content model behind the strategy).
Are you ready to run your first SERP Analysis using Natural Language Processing?
Get in contact with our SEO management service team now!
All of this wouldn’t happen without Jason’s challenge of “visualizing” E-A-T and brand serps and this work is dedicated to him and to the wonderful community of marketers, SEOs, clients and partners that are supporting WordLift. A big thank you also goes to the open-source technologies used in this experiment:
- Google Search library by Mario Vilas
- Trafilatura by Adrien Barbaresi
- Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to preprint: arxiv.org/abs/1703.00565
- BERT with pre-trained model for questions answering by Google Research Team and made easy to use by @HuggingFace 🤗
- BART developed by Lewis et al. 2019 and made easy to use by @HuggingFace 🤗