What is Semantic Search, and how does it work?

What is Semantic Search, and how does it work?

If you work in SEO, you have been reading about Google and Bing becoming semantic search engines but, what does Semantic Search really mean for users, and how things work under the hood?

Semantic Search helps you surface the most relevant results for your users based on search intent and not just keywords.

Semantic (or Neural) Search uses state of the art deep learning models to provide contextual and relevant results to user queries. When we use semantic search we can immediately understand the intent behind our customers and provide significantly improved search results that can drive deeper customer engagement. This can be essential in many different sectors but – here at WordLift – we are particularly interested in applying these technologies to: travel brands, e-commerce and online publishers.  

Information is often unstructured and available in different silos, using semantic search our goal is to use machine learning techniques to make sense of content and to create a context. When moving from syntax (for example how often a term appears on a webpage) to semantics, we have to create a layer of metadata that can help machines grasp the concepts behind each word. Google defines this ability to connect words to concepts as “Neural Matching” or *super synonyms* that help better match user queries with web content. Technically speaking this is achieved by using neural embeddings that transform words (or other types of content like images, video or audio clips) to fuzzier representations of the underlying concepts.

As part of the R&D work that we’re doing, in the context of the EU-cofounded project called WordLift Next Generation, I have built the prototype using an emerging open-source framework called Jina AI and the beautiful photographic material published by Salzburgerland Tourismus (also a partner in the Eurostars research project) and Österreich Werbung 🇦🇹 (Austrian National Tourist Office).

I have created this first prototype:

  • ☝️ to understand how modern search engines work;
  • ✌️ to re-use the same #SEO data that @wordliftit publishes as structured *linked* data for internal search.

How does Semantic Search work?

Bringing structure to information, is what WordLift does by analyzing textual information using NLP and named entity recognition, and now also images using deep learning models.

With semantic search, these capabilities are combined to let users find exactly what they need naturally.

In Jina, Flows are high-level concepts that define a sequence of steps to accomplish a task. Indexing and querying are two separate Flows; inside each flow, we run parallel Pods to analyze the content. A Pod is a basic processing unit in a Flow that can run as a dockerized application.

This is strategic as it allows us to distribute the load efficiently. In this demo, Pods are programmed to create neural embeddings: one pod to processes text and one for images. Pods can also run in parallel and the results (embeddings from the caption and embeddings from the image) are combined into one single document.

This ability to work with different content types is called multi-modality.

The user uses a text in the query to retrieve an image or vice-versa; the user uses an image, in the query, to retrieve its description.

See in the example below; I make a search using natural language at the beginning and right after, I send an image (from the results of the first search) as query to find its description 👇

Are you ready to innovate your content marketing strategy with AI? Let’s talk!

What is Jina AI?

Han Xiao, Jina AI’s CEO, calls Jina the “TensorFlow” for search 🤩. Besides the fact that I love this definition, Jina is completely open source, and designed to help you build neural (or semantic) search on the cloud. Believe me it is truly impressive. To learn more about Jina, watch Han’s latest video on YouTubeJina 101: Basic concepts in Jina“.

How can we optimize content for Semantic Search?

Here is what I learned from this experiment:

  1. When creating content, we shall focus on concepts (also referred to as entities) and search intents rather than keywords. An entity is a broader concept that groups different queries. The search intent (or user intent) is the user’s goal when making the query to the search engine. This intent can be expressed using different queries. The search engines interpret and disambiguate the meaning behind these queries by using the metadata that we provide.
  2. Information Architecture shall be designed once we understand the search intent. We are used to thinking in terms of 1 page = 1 keyword, but in reality, as we transition from keywords to entities (or concepts), we can cover the same topic across multiple documents. After crawling the pages, the search engine will work with a holistic representation of our content even when it has been written across various pages (or even different media types).
  3. Adding structured data for text, images, and videos adds precious data points that will be taken into account by the search engine. The more we provide high-quality metadata, the more we help the semantic search engine improve the matching between content and user intent.
  4. Becoming an entity in Google’s Knowledge Graph also greatly helps Google understand who we are and what we write about. It can have an immediate impact across multiple queries that refer to the entity. Read this post to learn more how to create an entity in Google’s graph

Working with Semantic Search Engines like Google and Bing, require an update of your content strategy and a deep understanding of the principles of Semantic SEO and machine learning.

Text Summarization in SEO with the Help of AI

Text Summarization in SEO with the Help of AI

In the last two decades, text summarization has played an essential role in search engine optimization (SEO). There are, indeed, a lot of different marketing techniques that require a summary of the content, and that can improve ranking in search engines. Meta descriptions are probably among the most notable examples (here is a video tutorial that Andrea did on generating meta descriptions).

These text snippets provide search engines with a brief description of the page content and are still an important ranking factor and one of the most common use cases for text summarization.

Thanks to the latest NLP technologies, SEO specialists can finally summarize the content of entire webpages using algorithms that craft easy-to-read summaries.

In this article, we will discuss the importance of using text summarization in the context of SEO and digital marketing.

Summaries help create and structure the metadata that describes web pages. Text summarization also comes in handy when we want to add descriptive text to category pages for magazines and eCommerce websites or when we need to prepare the copy for promoting our latest article to Facebook or Twitter. Much like search engines use meta descriptions, social networks rely on their meta descriptors like the Facebook Open Graph meta tag (a.k.a. OG tag) to distribute content to their users. Facebook for instance, uses the summary provided in OG tags to create the card that promotes a blog post on mobile and desktop devices.

Extractive vs Abstractive

There are many different text summarization approaches, and they vary depending on the number of input documents, purpose, and output. But, they all fall into one of two categories: extractive or abstractive.

Extractive summarization algorithms identify essential sections of a text and generate verbatim to produce a subset of the sentences from the original input. 

Extractive summaries are reliable because they will not change the meaning of any sentence. They are generally easier to program. It’s very logical, and in the most straightforward implementations, the most common words in the source text are the words that represent the main topic. Using today’s pre-trained Transformer models with their ground-breaking performance, we can achieve excellent results with the extractive approach.

In WordLift, for instance, BERT is used to boost the performance of extractive summarization across different languages. Here is the summary that WordLift creates for me for this article that you are reading.

In the last two decades, text summarization has played an essential role in search engine optimization (SEO). While our existing BERT-based summarization API performs well in German, we wanted to create unique content instead of only shrinking the existing text.

It is quite useful in summarizing, using the most important sentences, the content that I am writing here, and it is formally correct (or at least as valid as my writing) but not unique.

Using Transformers in extractive text summarization.

Abstractive Text Summarization

Abstractive methodologies summarize texts differently, using deep neural networks to interpret, examine, and generate new content (summary), including essential concepts from the source.

Abstractive approaches are more complicated: you will need to train a neural network that understands the content and rewrites it.

In general, training a language model to build abstract summaries is challenging and tends to be more complicated than using the extractive approach. Abstractive summarization might fail to preserve the meaning of the original text and generalizes less than extractive summarization.     

As humans, when we try to summarize a lengthy document, we first read it entirely very carefully to develop a better understanding; secondly, we write highlights for its main points. As computers lack most of our human knowledge, they still struggle to generate summaries’ human-level quality. 

Moreover, using abstractive approaches also poses the challenge of supporting multilingualism. The model needs to be trained for each language separately.

Training our new model for German using Google’s T5

T5 text-to-text framework pre-trained on a multi-task mixture of NLP unsupervised and supervised tasks.

As part of the WordLift NG project and, on behalf of one of our German-speaking clients, we ventured into creating a new pre-trained language model for automatic text summarization in German. While our existing BERT-based summarization API performs well in German, we wanted to create unique content instead of only shrinking the existing text. 

T5 “the Text-To-Text Transfer Transformer Model” is Google’s state of the art LM, which was proposed earlier this year in the paper, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.”  

This new language model is revolutionary as we can re-use the same model for different NLP tasks, including summarization. T5 is also language-agnostic, and we can use it with any language.

We successfully trained the new summarizer on a dataset of 100,000 texts together with reference summaries extracted from the German Wikipedia. Here is a result where we see the input summary that was provided along with the full text of the Wikipedia page on Leopold Wilhelm and the predicted summary generated by the model.

Conclusions and future work

We are very excited about this new line of work and we will continue experimenting new ways to help editors, SEOs and website owners improve their publishing workflows with the help of NLP, knowledge graphs and deep learning!

WordLift provides websites with all the AI you need to grow your traffic — whether you want to increase leads, accelerate sales on your e-commerce or build a powerful website. Let’s talk with our expert to find out more!

How to turn your content model into a powerful marketing and SEO weapon

How to turn your content model into a powerful marketing and SEO weapon

Providing your website with a structured content model is not only the best solution to better organize your content, but also a powerful strategy to improve the SEO of your website and increase organic traffic.

In this article, we’ll explain how the WordLift entity-based model, coupled with the new feature WL Mappings, will allow you to add a more specific markup to your content and to obtain a Knowledge Graph capable of communicating to search engines in a more effective way.

Why is the content model becoming a key tool for SEO?

During his webinar on content modeling for SEO in the WordLift Academy, Cruce Saunders highlights some of the main features that make the content model an indispensable tool for managing and enhancing online content.

In fact, the content model:

  • Specifies how information is organized on your website
  • Makes content more visible to search engines
  • Allows you to reuse content through different channels

Structured content model, in short, not only allows you to better organize data and information but to do it within a malleable structure capable of communicating:

  • to search engines through the use of structured data
  • to users through the enhancement of the user experience and the possibility of reusing the content by presenting it in the form of different layouts both on the site and in the SERP to respond to specific search intents (the same article, for example, may appear in the form of a snippet, of blue link, promotion, etc.)

Structuring your content model means creating a three-dimensional identity capable of highlighting your content and the relationships underlying it. This allows search engines to recognize you among hundreds of pieces of information, making you more visible to users who correspond to the search intent related to your business.

The more the content model is rich in structured data, the more chances you’ll have to meet exactly the users interested in you. That’s why we created WordLift Mappings, a new feature that allows you to select the information and connections that are truly relevant for your business and to create an increasingly specific and refined Knowledge Graph to highlight only the most relevant facets that make your identity more authoritative. 

Through our entity-based model and WordLift Mappings, your content model becomes a powerful SEO weapon and a valuable resource to increase the value of your online data.

WordLift Mappings helps you create a custom Knowledge Graph and increase your online authority

WordLift creates a personalized and highly performing Knowledge Graph through the schema.org markup and the creation of a customized entity-based vocabulary containing the most relevant data to help Google better understand your online content.

Remember that Google uses entities to satisfy users’ search intent and allow them to find the best results. For this reason, an increasingly refined entity-based model such as WordLift is key to increase the visibility of your content for search engines.

WordLift Mappings increases the accuracy of this process and allows you to take greater care of your content and your Knowledge Graph.

Advanced Custom Field for Schema.org: how the new WordLift extension works

By connecting to ACF (Advanced Custom Field), a WordPress plugin that allows you to create advanced fields to specify the attributes that characterize your content, WordLift Mappings allows you to structure your data starting from the fields that you have already configured with ACF or from new fields based on the schema.org taxonomy.

This means having more and more structured content, which can be used to add relevant details to your Knowledge Graph and shaped in different configurations to improve the user experience.

In this webinarAndrea Volpini and Jason Barnard explain how they used WordLift Mappings to improve Jason’s content model and Knowledge Graph. Jason shows how he obtains a Knowledge Graph in which only the most relevant data to create an authoritarian and coherent Brand SERP are structured to stand out in the search results.

With over 100 podcasts made in collaboration with some of the greatest SEO experts on the planet, Kalicube – Jason Barnard‘s website – has an enviable wealth of content, relevant to the entire digital marketing sector. Thanks to WordLift Mappings, we helped Jason structure this content by following a model that focuses on content, events and people. Thus, each guest of Jason’s podcast has his own page connected with the podcasts in which he participated and with the events in which the podcasts were recorded.

In this way, the architecture of the Knowledge Graph is customized on the basis of the content model and the “network” of the links between the contents becomes the bearer of meanings and allows you to predict further connections. Below you can see the entity-based content model applied to Kalicube through WordLift Mappings.

What can you realistically expect in terms of traffic? How long does it take?

In recent weeks, our SEO team has implemented a new content model on the site of an American customer who deals with the dismantling and re-evaluation of used hardware on a large scale.

The results? After the first week, traffic increased by 14.6% and the growth curve does not seem to stop. To analyze the impact, isolating other factors that may influence the SEO of the site, we have developed a predictive model based on Bayesian networks which, analyzing the traffic in the month preceding the introduction of the WordLift Mappings, allowed us to isolate the benefit to the net of other ranking factors (it’s called causal inference analysis).

Here we see the real clicks in black and the traffic we would have had in blue (that is, the traffic predicted by the mathematical model), then in the following chart the difference between the real traffic and the estimated traffic and finally the delta of increase. In this way, we can be sure that, as analyzed, it has statistical relevance and is related to the introduction of the new content model. ?

Data source: Google Search Console

In summary, WordLift Mappings allows you to:

  • Build a Custom Knowledge Graph based on your content model
  • Improve the SEO of your website through structured data
  • Shape the structure of your content to improve the user experience
  • Reuse chunks of content through different configurations to respond to different research purposes
  • Enhance any type of content composed of reusable elements (articles, courses, events, How-Tos etc.)

The implementation of a custom Knowledge Graph through WordLift Mappings has a positive and measurable impact on traffic.

SERP Analysis with the help of AI

SERP Analysis with the help of AI

SERP analysis is an essential step in the process of content optimization to outrank the competition on Google. In this blog post I will share a new way to run SERP analysis using machine learning and a simple python program that you can run on Google Colab. 

Jump directly to the code: Google SERP Analysis using Natural Language Processing

SERP (Search Engine Result Page) analysis is part of keyword research and helps you understand if the query that you identified is relevant for your business goals. More importantly by analyzing how results are organized we can understand how Google is interpreting a specific query. 

What is the intention of the user making that search?

What search intent Google is associating with that particular query?

The investigative work required to analyze the top results provide an answer to these questions and guide us to improve (or create) the content that best fit the searcher. 

While there is an abundance of keyword research tools that provide SERP analysis functionalities, my particular interest lies in understanding the semantic data layer that Google uses to rank results and what can be inferred using natural language understanding from the corpus of results behind a query. This might also shed some light on how Google does fact extraction and verification for its own knowledge graph starting from the content we write on webpages. 

Falling down the rabbit hole

It all started when Jason Barnard and I started to chat about E-A-T and what technique marketers could use to “read and visualize” Brand SERPs. Jason is a brilliant mind and has a profound understanding of Google’s algorithms, he has been studying, tracking and analyzing Brand SERPs since 2013. While Brand SERPs are a category on their own the process of interpreting search results remains the same whether you are comparing the personal brands of “Andrea Volpini” and “Jason Barnard” or analyzing the different shades of meaning between “making homemade pizza” and “make pizza at home”. 

Hands-on with SERP analysis

In this pytude (simple python program) as Peter Norvig would call it, the plan goes as follow:

  • we will crawl Google’s top (10-15-20) results and extract the text behind each webpage
  • we will look at the terms and the concepts of the corpus of text resulting from the download, parsing, and scraping of web page data (main body text) of all the results together, 
  • we will then compare two queries “Jason Barnard” and “Andrea Volpini” in our example and we will visualize the most frequent terms for each query within the same semantic space, 
  • After that we  will focus on “Jason Barnard” in order to understand the terms that make the top 3 results unique from all the other results, 
  • Finally using a sequence-to-sequence model we will summarize all the top results for Jason in a featured snippet like text (this is indeed impressive),
  • At last we will build a question-answering model on top of the corpus of text related to “Jason Barnard” to see what facts we can extract from these pages that can extend or validate information in Google’s knowledge graph.

Text mining Google’s SERP

Our text data (Web corpus) is the result of two queries made on Google.com (you can change this parameter in the Notebook) and of the extraction of all the text behind these webpages. Depending on the website we might or might not be able to collect the text. The two queries I worked with are “Jason Barnard” and “Andrea Volpini” but you can query of course whatever you like.   

One of the most crucial work, once the Web corpus has been created, in the text mining field is to present data visually. Using natural language processing (NLP) we can explore these SERPs from different angles and levels of detail. Using Scattertext  we’re immediately able to see what terms (from the combination of the two queries) differentiate the corpus from a general English corpus. What are, in other words, the most characteristic keywords of the corpus. 

The most characteristics terms in the corpus.

And you can see here besides the names (volpini, jasonbarnard, cyberandy) other relevant  terms that characterize both Jason and myself. Boowa a blue dog and Kwala a yellow koala will guide us throughout this investigation so let me first introduce them: they are two cartoon characters that Jason and his wife created back in the nineties. They are still prominent as they appear on Jason’s article on a Wikipedia as part of his career as cartoon maker.

Boowa and Kwala

Visualizing term associations in two Brand SERPs

In  the scatter plot below we have on the y-axis the category “Jason Barnard” (our first query), and on the x-axis the category for “Andrea Volpini”. On the top right corner of the chart we can see the most frequent terms on both SERPs – the semantic junctions between Jason and myself according to Google.

Not surprisingly there you will find terms like: Google, Knowledge, Twitter and SEO. On the top left side we can spot Boowa and Kwala for Jason and on the bottom right corner AI, WordLift and knowledge graph for myself.  

To extract the entities we use spaCy and an extraordinary library Jason Kassler called Scattertext.

Visualizing the terms related to “Jason Barnard” (y-axis) and “Andrea Volpini” (x-asix). The visualization is interactive and allows us to zoom on a specific term like “seo”. Try it.

Comparing the terms that make the top 3 results unique

When analyzing the SERP our goal is to understand how Google is interpreting the intent of the user and what terms Google considers relevant for that query. To do so, in the experiment, we split the corpus of the results related to Jason between the content that ranks in position 1, 2 and 3 and everything else.

On the top the terms extracted from the top 3 results and below everything else. Open the chart on a separate tab from here.

Summarizing Google’s Search Results

When creating well-optimized content professional SEOs analyze the top results in order to analyze the search intent and to get an overview of the competition. As Gianluca Fiorelli, whom I personally admire a lot, would say; it is vital to look at it directly.

Since we now have the web corpus of all the results I decided to let the AI do the hard work in order to “read” all the content related to Jason and to create an easy to read summary. I’ve experimented quite a lot lately with both extractive and abstractive summarization techniques and I found that, when dealing with an heterogeneous multi-genre corpus like the one we get from scraping web results, BART (a sequence-to-sequence text model) does an excellent job in understanding the text and generating abstractive summaries (for English).

Let’s it in action on Jason’s results. Here is where the fun begins. Since I was working with Jason Barnard a.k.a the Brand SERP Guy, Jason was able to update his own Brand SERP as if Google was his own CMS ? and we could immediately see from the summary how these changes where impacting what Google was indexing.

Here below the transition from Jason marketer, musicians and cartoon maker to Jason full-time digital marketer.

Can we reverse-engineer Google’s answer box?

As Jason and I were progressing with the experiment I also decided to see how close a Question Answering System running Google , pre-trained models of BERT, could get to Google’s answer box for the Jason-related question below.

Quite impressively, as the web corpus was indeed, the same that Google uses, I could get exactly the same result.

A fine-tuning task on SQuAD for the corpus of result of “Jason Barnard”

This is interesting as it tells us that we can use question-answering systems to validate if the content that we’re producing responds to the question that we’re targeting.

Lesson we learned

We can produce semantically organized knowledge from raw unstructured content much like a modern search engine would do. By reverse engineering the semantic extraction layer using NER from Google’s top results we can “see” the unique terms that make web documents stand out on a given query.

We can also analyze the evolution over time and space (the same query in a different region can have a different set of results) of these terms.

While with keyword research tools we always see a ‘static’ representation of the SERP by running our own analysis pipeline we realize that these results are constantly changing as new content surfaces the index and as Google’s neural mind improves its understanding of the world and of the person making the query.

By comparing different queries we can find aspects in common and uniqueness that can help us inform the content strategy (and the content model behind the strategy). 

Are you ready to run your first SERP Analysis using Natural Language Processing?

Get in contact with our SEO management service team now!

Credits

All of this wouldn’t happen without Jason’s challenge of “visualizing” E-A-T and brand serps and this work is dedicated to him and to the wonderful community of marketers, SEOs, clients and partners that are supporting WordLift. A big thank you also goes to the open-source technologies used in this experiment:

WordLift Next Generation receives grant from EU to Bring AI-Powered SEO to Any Website

WordLift Next Generation receives grant from EU to Bring AI-Powered SEO to Any Website

WordLift, Redlink GmbH, SalzburgerLand Tourismus GmbH and the Department of Computer Science of the University of Innsbruck teamed up under the WordLift Next Generation project to develop a new platform to deliver Agentive SEO technology to any website. The work started in February and will last for 36 months.

We are pleased to announce that we have received funding support from EU to develop a new technology that will be available for any Content Management System. 

The project, called WordLift Next Generation, will be developed with the financial support received from Eurostars H2020, a program promoted by the European Union that supports research activities and innovative SMEs. WordLift NG is part of a financing plan allocated by the EU to make European companies more competitive through AI tools at the service of businesses and people.

As WordLift CEO Andrea Volpini stated recently, “Artificial Intelligence is shaping the online world with huge investments from GAFAM  (Google, Apple, Facebook, Amazon e Microsoft). Our company successfully brought these technologies to mid/small size content owners, SMEs and news publishers worldwide using WordPress. It’s time to expand outside of the WordPress ecosystem while adding new services such as semantic searchimproved content recommendations and conversational UIs for the Google Assistant and Alexa to help this market segment remain competitive.”

WordLift automates and streamlines the technical processes required to make a website discoverable through search engines and personal digital assistant; we have been first to market a Knowledge Graph optimized for SEO, combining semantic annotations with information publicly available as linked open data. 

Goals of the Project

With WordLift NG, the consortium plans to improve the way in which our software understands web articles and builds knowledge bases, employing semantic technology. With a more powerful knowledge graph, it will be possible to fully decouple WordLift from WordPress to make this technology available to any website worldwide. The consortium also aims to improve the quality of the content recommendations and to bring an engaging semantic search experience. Last but not least, as the knowledge base behind each website will improve, it will be possible to enable conversational experiences over Google Assistant and Alexa (focus will be on news and media and hospitality sector). 

The first partner meeting in Rome — February 2020

Meet the Consortium

To achieve these ambitious goals, WordLift has teamed with leading organizations in Europe in the field of AI, NLP and Semantic Technologies and tourismus: Redlink GmbH (led by Andreas Gruber CEO of the company), the Semantic Technology Institute at the Department of Computer Science of the University of Innsbruck (under the supervision of Ass.-Prof. Dr. Anna Fensel) and, SalzburgerLand Tourismus GmbH (with Martin Reichhart, Innovation Manager as coordinator). 

Thanks to the EU funded project and the collaboration between Italy and Austria, WordLift NG will democratize the usage of agentive SEO, developing a complete new technology stack to help businesses around the world remain competitive in the ever-changing search and digital marketing landscape. The project has officially started on February 1, 2020, and will be completed in 36 months. 

Drone selfie after the partner meeting