What are embeddings?

In natural language processing the goal is to have machines understand human language. Unfortunately, machine learning and deep learning algorithms only work with numbers so how can we convert the meaning of a word to a number?

This is what embeddings are for. Teaching language to computers by translating meanings into mathematical vectors (series of numbers).

Word embeddings

In word embeddings, the vectors of semantically similar terms are close to each other. In other terms, words that have a similar meaning will have a similar distance in a multi-dimensional vector space.

Here is a classical example – “king is to queen as man is to woman” encoded in the vector space as well as verb Tense and Country and their capitals are encoded in low dimensional space preserving the semantic relationships.

Knowledge Graph embeddings

We can use the same technique used for words also for analyzing nodes (entities) and edges (relationship) in a knowledge graph. By doing so we can encode the meanings in a graph in a format (numerical vectors) that we can use for machine learning applications.

You can create graph embeddings from the Knowledge Graph that WordLift creates by reading the article above.

In the following presentation, I introduce the concept of multidimensional meanings using a song from “The Notorious B.I.G.”, undoubtedly one of the biggest rappers of all time. The song is called What’s Beef?.

In the text of the song, there is a play on the homophones “I see you” and “ICU” which is the acronym for intensive care unit most interestingly the word Beef assumes different meanings in every sentence. As we can see meanings change based on the words of each sentence. The idea that meanings can be derived by the analysis of the closest words was introduced by Firth, an English linguist and a leading figure in British linguistics during the 1950s.

Firth is known as the father of distributional semantics a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data

It is using this exact framework (studying semantic similarities between terms inside a given context window) that we can train a machine to understand its meaning.

Cosine similarity

When we want to analyze the semantic similarity of two documents (or two queries), and we have turned these documents into mathematical vectors, we can use the cosine of the angle between their respective vectors.

The real advantage is that two similar documents might still be far apart when calculating the Euclidean distance if they use different words with similar meanings. We might have, for example, the term ‘soccer’ that appears fifty times in one document and ten times in another. Still, they will be considered similar when we analyze their respective vectors within the same multidimensional space. 

The reason is that even if the terms used are different, as long as their meaning is similar, the orientation of their vectors will also be similar. In other words, a smaller angle between two vectors represents a higher degree of similarity.

Embeddings are one of the different techniques we can use to analyze and cluster queries. See our web story on keyword research using AI to find out more.

The Power of Embeddings in SEO 🚀

Embeddings have revolutionized Search and the SEO landscape, and to help you navigate this shift, I’ve created a straightforward tutorial on fine-tuning embeddings tailored to your content (here is the 🐍 code on a notebook that you can play with).

But before we delve into that, let’s understand the significance of embeddings and how they can help you with your SEO strategies. Here are some tasks where embeddings prove to be extremely useful:

  • Content Optimization: Analyze embeddings of top-ranking pages to identify themes.
  • Link Building: Find semantically related pages by comparing embeddings.
  • Keyword Research: Discover semantically related queries to target keywords.
  • Ranking Models: Predict page rankings for specific keywords using embeddings.
  • Site Migration: Check out the latest article by Michael King on this.
  • Dynamic 404 Pages: Learn more about it here (specifically for e-Commerce)

Train your embeddings for your next SEO task

But, can we train embeddings using our content? Absolutely! Especially when dealing with domain-specific language, it’s beneficial to train your embeddings.

Here’s how:

👉 You can jump directly to the Colab here and follow the instruction in the notebook.

  • We will use the Finetuner by Jina AI.
  • A simple dataset from datadotworld.
  • Last we’ll visualize the embeddings using Atlas by Nomic AI.

In this tutorial, I used a craft beer dataset. Our objective is straightforward: to train our embeddings using the names and styles of the beers. To achieve this, we’ll employ Finetuner by Jina AI, which allows us to tailor an existing model (in our case, bert-base-en) to craft our custom embeddings.

Having a small dataset the process is blazing fast and you can immediately see the performance improvement in comparison with the standard model.

We “learn” embeddings by running a classification task, in our tutorial we work on classifying beer names with beer styles.

How can we use embeddings?

Now that the fine-tuning process is completed we can do something as simple as providing a list of beers and look for the ones that match the “Indian Pale Ale” style.

We are bringing all the embeddings into a DocArray and use query.match() to find the beers that best match our query for “Indian Pale Ale”.

Visualizing Embeddings using Atlas

We can also visualize our embedding using Atlas by Nomic AI. Atlas is a super simple library to help us visualize the latent space behind the embeddings.

Let’s have a look at the Map that we have created 👉 https://wor.ai/tBb7mM and how for example we can easily group all the “Ale” beers.

The Atlas map that represents our embeddings.

This method offers an effective way to review and share embeddings with your team. It’s invaluable in helping everyone making sure whether the embeddings are being fine-tuned correctly and if semantically similar items are indeed moving closer together.

Why create your own embeddings?

  1. Customization: Pre-trained embeddings are trained on general text corpora (like Wikipedia or news articles), so they might not adequately capture the semantics of domain-specific language. If you’re dealing with a specialized field, like beer brewing in my example, pre-trained embeddings might not know that “stout” and “porter” are similar because they are both dark beers. Training your own embeddings on your specific text data can capture these domain-specific relationships.
  2. Up-to-date language: Language evolves over time, and pre-trained embeddings might not include the most recent language trends or slang. By training your own embeddings, you can ensure that they are up-to-date with the latest language used in your specific domain.
  3. Reduced size: Pre-trained embeddings often include vectors for millions of words, many of which you might not need. By training your own embeddings, you can limit the vocabulary to just the words you care about, reducing the size of your embeddings.