Select Page

By Andrea Volpini

2 years ago

Learn how to automate internal link building for your e-commerce category pages by creating related search widgets.

Why are internal links important for product listing pages (PLP) on e-commerce websites? How can we help users and Google more effectively find category pages? Can we automate the creation of internal links? What’s the value for SEO?

In this blog post we will focus on automating the creation of internal links for e-commerce category pages. We will create the so-called related search widget for an e-commerce website, a navigational element designed to recommend similar categories, to improve internal links and to boost rankings.

We structure content on websites to let people find what they want. There is always beauty in understanding how things are organized on an e-commerce website. In SEO, when we are involved with user experience, our ultimate goal is to find the truth (the essence of any webpage) and to render the intent. Peter Morville will say that, when organizing content, we create environments for understanding

Here is the outline for this article. If you prefer to jump right into code here is the Colab.

We will create recommended links using a small set of commands in Python and a minimal amount of deep learning. Before anything else, let’s review two essential aspects:

  • Categorization is a selective process. We emphasize one aspect and silence many others. When it works, it conveys meaning and helps others find what they need.
  • In a connected graph of web pages, a page’s closeness centrality represents its ability to be central within its network. The more relevant we are, the easier it gets to reduce the number of clicks a user needs to find what he/she needs.

In layman’s terms, we need a function, a simple system that, when we input an X (let’s say the title of a category), will give as a Y (the set of the top 4 or 5 related categories).

E-Commerce category pages tend to have breadcrumbs and hierarchical links (with the entire list of categories and subcategories). Instead, we want to add a navigational element that can traverse the hierarchical tree in a meaningful way.

Links on category pages are usually limited to breadcrumbs trail and taxonomy-based filters (the characteristics of the set of products). Recommending links brings the following SEO benefits:

  • Skipping ahead. Related links help users traverse the navigational tree of categories and jump where they need to be. They are horizontal and meant to reduce the click-depth.
  • Improving rankings. Internal links have a tremendous value in helping search engines understand how categories are organized.
  • Distributing pagerank. We want to distribute link equity and ensure that the crawler sees our most relevant pages with the least effort.
  • Optimizing the anchor text. We can improve the ranking of a specific query by using it as the ​​clickable text that a user will see.

Moreover, on the business side, having the ability to recommend categories helps the shop owner improve the business relevancy of search by:

  • Prioritizing categories for a sales campaign.
  • Promoting certain products.
  • De-prioritizing categories containing out of stock products.

There are various examples of internal links on e-commerce (and non e-commerce) websites, let’s review a few of them:

Amazon.com for gaming keyboard

Amazon uses a block of 6 elements that, as we can see, tend to broaden (keyboard, gaming pc), narrow (gaming keyboard 60 percent), or horizontally expand (gaming monitors, gaming mouse) the initial search.

Alibaba.com for Men’s Coats

In Alibaba, the textual relatedness is weaker. The semantic jump between men’s coats and dog coats is extreme. Besides the questionable association between men and dogs, the focal points remain clothing for men. The design is essential.

Kijiji.ca for Outdoor & Garden 

Kijiji labels it as “popular,” and its generation process cannot detect that ​​lawn mower and ​​lawnmower are synonyms. At least in this example, it tends to narrow the search intent. The terms being used are keywords and not proper category names.

Artsper.com for Pop Art Paintings

Artsper introduces the concept of search refinement by characterizing the block with “Refine your search”. The navigation elements help us move in multiple directions without clear sorting criteria. This is per sé not a bad thing, quite the opposite, we perceive a sense of freedom, and we can quickly skim through the terms. Visually the terms are presented as refinement chips.

Here is how things work on Google Search, Google Images and Google Arts and Culture. This is a random exploration of various types of widgets that should give us some ideas on how things can be implemented.

Google Images for apple pie 

Being primarily an image-centric medium, Google Images helps us with the use of images, in this specific occurrence, to broaden (cake) or to expand the search (meat, pecan).

Google Search for sunglasses
Google Arts & Culture for Giorgio de Chirico

As seen, with this limited selection of the different implementations, we can highlight the following:

  • Text relevance is essential and far from being trivial. As seen in the Alibaba example, even advanced websites can fall into the trap of odd matches.
  • Most of the implementations are based on a horizontal design. If the complexity (i.e., the number of recommended links) is limited, this is an excellent way to provide options without interfering with the facets typically displayed vertically.
  • Refinement (or search) chips are a good design pattern to help users intuitively find what they need. Google uses them a lot across various surfaces.
  • Adding visual elements (a featured image for each category) and the number of items behind the category is an intelligent option to facilitate the discovery of different products (this is extremely valuable when products have a solid visual appeal).

Generative AI can help to create internal links. The workflow is simple and provides a base for understanding how things work behind the scenes. You can find the Google Colab here.

I used as a reference website fila.com a sportswear manufacturer originally from Biella in the north part of Italy. They are not clients of ours, and here is what we will do:

  1. Read the sitemap and extract the list of categories
  2. Parse all the text elements we need
  3. (Extract queries from Google Search Console – I have the code ready, but it will not run for fila.com as I don’t have access to their search data)
  4. Run Semantic Search
    • Extract top n matches (semantic similarity)
    • Re-rank results (additional business logic, if needed, would go here)
  5. Prepare the output file. This would be a JSON file containing a selection of similar categories for each category page.

The UX of the website is clean and the site lacks a related search widget.

1. Accessing the sitemap using Advertools

We will parse the sitemap and extract the list of category pages by removing any page that ends with “.html” (as this characterizes product pages) and a series of other pages that don’t correspond with the product listing (i.e., “news”, “about-” and so on).

2. Extract textual elements from each page

We will then extract from each page a minimum set of information, including the short intro text below the page’s title, the breadcrumbs, the meta description and the page’s title. In the snippet below, we can see that we are running a custom extraction using xpath. If the intro text is missing, we can rely on the other textual elements of the page. We will need to be very careful in removing oddities or other terms that might compromise the search.

Advertools will store the captured data on fl_category_crawl.jl. We might keep this file so that information will be re-used for the next crawl. Here we can see the result of the extraction.

After extracting the title of the page and the short intro text, we will analyze the breadcrumbs and create a data frame. This helps us gain an understanding of the site structure. We might reuse this data frame while composing the final list of suggestions. We might, for example, decide to exclude a link already in the Breadcrumbs for that page. Repeating the same link can be annoying, especially if the related search widget is displayed close to the Breadcrumbs.

To clean up the captured text, I have used spaCy and a list of site-specific stopwords. We will also remove special characters, numbers, and other oddities. I decided to lemmarize terms; this means bringing back the base or dictionary form of a word. We will create embeddings afterward and I want consistency from the beginning. This will help as we have a limited amount of text available.

3. Extract queries from GSC (optional)

Optionally I have also prepared the code to capture data from Google Search Console. You can take advantage of the list of queries behind each page and the number of clicks. Queries can be extremely valuable as you might decide to use them instead of the title of the pages.

Let me give you an example. We might have a long title like “Men’s Casual Sneakers + Athletic Shoes | FILA”; in this case it would be better to display something more compact like “Men’s Sneakers”. You will need to authenticate on GSC to extract the data. The information will be merged with the crawl dataset by running a loop with all the crawled urls.  

4. Computing semantic similarity

Here comes the AI bit of this workflow. We are going to use the SentenceTransformers (SBERT) library. This open-source library allows us to replace the underlying model and choose the best models that fit our needs. Models are available in the HuggingFace Model Hub. We can also eventually train our model to improve the performance further.

We will index the text extracted from each page and use the title as a query. We will use the native semantic search functionality of SBERT.

The idea behind is as simple as encoding the text in the “clean text” column and comparing it, within the same vector space, with the embedding of the title.

Semantic Search using SBERT

5. Preparing the output file

Once we run the same query on the complete list of category pages, we will get a new data frame that, for each page, will provide a list of recommended links.

Now, depending on the CMS, you can change the output format and get ready to publish it. In our case, we will write the data back into the Knowledge Graph and send it to the CMS using a REST interface (i.e., https://api.wordlift.io/data/https/www.example.com/en-us/category/my-category-page). In the Colab, the data is stored in a JSON file, and you can explore it directly from the notebook.

The file output in JSON

The site navigation schema markup

We will import the data into the knowledge graph and present it to search engines using structured data markup. A related search widget is a navigation site element; we can use the schema markup for SiteNavigationElement, a subclass of the WebPageElement.

This markup will help search engines understand how things are connected.

An excerpt of the markup where each link is presented as SiteNavigationElement

6. Scaling the workflow – a better AI lifecycle using NOW

One of the biggest challenges when adopting AI into SEO workflows is the design of a lifecycle that will scale across sites of different sizes and with other characteristics.

Working on Colab helps me envision how things should work; I can easily experiment with new ideas, but at some point, I will need to run the inference on sites with potentially thousands of category pages. Also, I need to have the flexibility to replace the model fine-tuning it. Even more importantly, on e-commerce sites, I want to be able to work with multiple modalities (text + images). On large properties like fila.com, the textual content is very well optimized, and I can easily rely on it but on smaller sites I will need to combine features from text with features extracted from images.

now.jina.ai

To do that, we partnered with Jina AI. As a quick introduction here, I have added the code to replicate the same neural search provided by SBERT using Jina NOW text-to-text search functionality. As you will see in the code, we will connect to an end-point on the Jina Cloud infrastructure and run queries there. This means having a dedicated pool of machines working on the generation of the embeddings and on running the neural search. The load will be distributed, and we will be able to autoscale resources as we increase the size of the dataset.

Conclusions And Future Work

On the SEO front, there are other essential analyses to be done. For proper link sculpting, we will need to prevent any form of cannibalization and also evaluate how to distribute links equally. Moreover, based on the website, I want to take into account the most representative products for a category and add the support for the analysis of product images and product descriptions.

On the tech side, Jina NOW has just recently launched and we are still working with the team at Jina AI to improve how things work behind the scenes. We want to be able to control the re-ranking directly inside Jina’s flow.

Happy SEO-automation!

Additional Questions

What are PDPs and PLPs in e-commerce websites?

PDP stands for Product Detail Page and represents the webpage that describes a single product. PLP stands for Product Listing Page and refers to a page that lists a category of products.

What is Jina AI?

Jina AI is a neural search framework to build scalable deep learning search applications. In this blog post we use Jina NOW, the simplest way to use semantic search in a distributed environment.

How to optimize e-commerce website SEO?

To optimize your e-commerce SEO strategy and get more conversions you can:

Must Read Content

The Power of Product Knowledge Graph for E-commerce
Dive deep into the power of data for e-commerce

Why Do We Need Knowledge Graphs?
Learn what a knowledge graph brings to SEO with Teodora Petkova

Generative AI for SEO: An Overview
Use videos to increase traffic to your websites

SEO Automation in 2024
Improve the SEO of your website through Artificial Intelligence

Touch your SEO: Introducing Physical SEO
Connect a physical product to the ecosystem of data on the web

Are you ready for the next SEO?
Try WordLift today!