What are 404 and 410 http errors? How do these error pages affect your SEO rankings and how can we make them more user-friendly using artificial intelligence and structured data?
In this blog post, we focus on building a semantic search engine for our e-commerce demo site that recommends products on the 404 error page. We will also venture into designing the layout of the page using DALL·E 2 and crafting the error message with the help of GPT-3.
Table of content:
- What is a 404 page?
- Handling 404 errors in SEO
- Why we need a custom 404 page
- How to Manage 404 pages for E-Commerce websites
- Build a multimodal search engine for product recommendations 🛒
- Collect product title, description and image
- Create the DocArray
- Run Jina NOW
- Create a 404 template using GPT-3 and DALL·E 2
- DALL·E 2 for the visuals
- GPT-3 for the copywriting
- Conclusions And Future Work
- Additional Questions
Believe it or not, this dreaded error page can be the most visited page on your website and represents an important point of contact for many website visitors.
I have always loved error pages because they can be very creative and, at their best, express the brand’s personality and core values: They bring a light sense of humor to otherwise frustrated users (see 404 from Salzburgerland.com, which I have always admired).
What is a 404 Page?
A 404 page is a landing page that informs your users that the requested page is not available or, in some cases, does not exist. The error message is displayed when a link is broken, the user has entered the URL incorrectly, or there is a configuration problem on the web server.
Handling 404 Errors in SEO
One thing must be said: Regardless of how much effort you put into editing your website, 404 errors are inevitable. However, there is a lot you can do to both increase brand awareness and make things easier for Google. A broken link is generally a bad experience for the user and bad for a search engine’s crawler. As the number of errors increases, a search engine will begin to lower the overall ranking of your website.
There are several reasons for a broken link:
- A page may have been deleted, such as a product page for a product that no longer exists, or some of the content may have been incorporated into another page.
- The page has been moved to a new address and no redirection has been configured.
- There is an incorrect url somewhere on the web (or on the website).
- The web server does not handle the trailing slashes correctly. Google crawls and evaluates individual URLs. This means that the URLs must be unique. A URL with and without a trailing slash at the end are effectively two URLs.
In general, it’s a good practice to redirect some of those 404 pages to relevant URLs on the site (I wrote about how to do that on scale here some time ago). It’s never helpful to redirect all traffic to the homepage, quite the opposite, this makes Google think your homepage is a soft 404. This is dramatic, yet very common, and we should always avoid it.
An error is a signal, for us (there is something that needs to be fixed) and for a crawler (this page does not exist anymore).
Why We Need A Custom 404 Page
It is important to send a clear message to the user: There is nothing here, the link you clicked on is broken. At the same time it is an opportunity to:
- Keep the user engaged on the website.
- Show them your brand identity.
- Reduce the bounce rate.
- Keep them in the funnel (this applies to e-commerce, but also to blog content designed for conversion).
- Get users to convert.
Design Patterns For A Good 404 Page
There’s always something new to learn from great websites, and when it comes to 404 pages, HubSpot has compiled a great list of examples. Among the most recurring patterns are the following:
- Links to the newest or most popular pages
- A search bar
- Creativity and humorous text to let visitors know they are in the wrong place
- Contact information
- Main navigation (header and footer)
- Optionally you can find:
- ONE CTA
- Links to social media
- A video content, doodle or something entertaining
- A link to report the error
How To Manage 404 Pages For E-Commerce Websites
E-commerce websites follow a different logic: every time a catalog is updated, something may change in the overall website architecture. When a product is out of stock (either temporarily or permanently), you generally have three options to choose from:
- Remove the page (PDP) so that the user lands on a 404 page. We can mitigate the impact by replacing any links we have to web pages at the top of the funnel, such as product reviews or blog posts.
- Leave the PDP untouched and focus the CTA on driving sales of similar products. I like this option because it keeps the SEO value intact, but of course it negatively impacts the user experience if it gets out of hand and the recommendations are not good enough. Here are two important recommendations you should follow:
- Use structured data to highlight the right product availability. There are several options to choose from, and even if Google shows items that are on backorder as out of stock, this is important information to have.
- Remember to show product availability on PLP pages and internal search results as well. It is important to inform the user before they come to the PDP.
- Add a 301 redirect to the PDP of another page with a similar product or related category. This is a good option, but also risky, as the new product will become outdated and you’ll create a chain of redirects that will break the crawl budget.
You need to make your choice based on several factors (is the unavailability temporary or is it permanent? Do you already have links leading to the site? Is the page well visited and so on) which solution works best.
Build A Multimodal Search Engine For Product Recommendations 🛒
Let us now focus on the first option and improve a customized 404 page using the product knowledge graph on our e-commerce demo site.
The workflow is simple and provides a foundation for understanding how things work behind the scenes. Here is the Google Colab we will use for the data preparation phase. Once the data is ready, we will install and run Jina NOW locally and deploy the search engine to the Jina Cloud to keep things super simple.
Here is what we will do:
- Gather product data using a GraphQL query
- Collect product title, description and image
- Push a DocArray – this is a key library in Jina designed for nested, unstructured, multimodal data, including text, image, audio, video, etc.
- Run Jina NOW Search
- Create a text-to-image search
- Index the DocArray created in the previous steps
- Run the queries:
- Extract top n matches given a text (semantic similarity)
- Re-rank results (additional business logic)
In this tutorial, we will use a demo e-commerce website with fictional products called Fashion Therapy. We created the underlying product catalog by slicing and enriching the WDC Schema.org Table Corpus, which is available for public download provided by the University of Mannheim.
The original dataset included 18,521 rows. After data cleaning and enrichment with categories (accessories, shoes, bags and clothing) and colors using our NLP stack, we obtained more than 4,500 products.
The site runs on WooCommerce and uses WordLift. The product data is the least you can find on an e-commerce website.
The same data used for structured data is also used to develop the semantic search engine. As we can see in the diagram below, structured data is provided with WordLift via GraphQL: a developer-friendly open source data query language.
If you run your own infrastructure, you may need to extract all structured data from your CMS or crawl the website (here is how to run web scraping for SEO tasks).
On the Jina side, we create a DocArray and push it to the Jina cloud. Launch Jina NOW from our local computer and index the DocArray we transferred to the cloud earlier.
In this implementation we will create a text-to-image search. NOW will index images using CLIP and we will run queries using text strings.
1. Gather product data using a GraphQL query
We will run the following query on the KG, as you can see I can access properties using the schema vocabulary:
To do this, we use a function called wl_graphql that returns a data frame with the attributes we need. We currently use only one image per product and rename the column with the URLs of the images uri. With this, we instruct Jina NOW that these are the URLs of the images to be indexed.
2. Collect product title, description and image
Next, we create a single column called full_text that contains both title and description, and apply two filters:
- One to select only recognized image formats (png, jpg, etc.).
- One to randomly select 300 products.
We might, in some cases, prefer the option of creating a text-to-text search engine instead of an image search, as we will do in this tutorial.
Choosing What We Want To Index
In the real world, we would apply business logic here, since it is not necessary to index the entire catalog. A recommender system, like the one we are building, needs to be intelligent enough to show only the best products (i.e., the most popular products or the products with the most hits) or what we want to sell (i.e., a combination of popular and on-sale products). For simplicity, we will use .sample(300) to select the products to index.
Make Sure That Files Are Accessible
Jina NOW will analyze the data frame (small_df), download and index all images in the uri column. Before we do that, let us check if all images are available.
We have two options:
- Downloading the files locally and creating the DocArray from the local files.
- Checking the validity of each URL when creating a DocArray from the data frame.
We will choose the second option as it is the simplest. So I added a simple function to check the https status of the images before creating the DocArray (this is optional, but might be helpful in some cases).
3. Create the DocArray
We can now create the DocArray with a simple line of code.
And now we can push it to the cloud with another command. We are going to generate a random name for it and use da.push().
We can now access the DocArray using its name (404_attvltyz in this case).
4. Run Jina NOW
Here comes neural search into action. We are going to use the Jina NOW library. This open-source library allows us to quickly set up text to image search.
We have partnered with Jina AI to integrate vector-based search into the WordLift platform, and we are particularly focused on e-commerce and conversational user interfaces. Running Jina NOW is pretty straightforward, but I would recommend following the instructions right here as things are constantly changing.
Select the text-to-image search, set the quality to medium, and specify a custom dataset by passing the name of the DocArray we created with Colab. Once the data stream is created, you can access the APIs directly or use the Streamlit playground to test things out (it’s ready for you to test here).
Testing the API
You can interactively test the search end-point using the swagger from 👉https://nowrun.jina.ai/api/v1/text-to-image/docs#/Text-To-Image/search_search_post.
See below the settings to run the query on the image-to-search Jina NOW built for our e-commerce demo site.
From there in WordPress you can edit 404.php and make the required changes like we did.
Let’s try a few examples on our demo site:
- https://product-finder.wordlift.io/the hat I like
- https://product-finder.wordlift.io/super duper leather dress
- https://product-finder.wordlift.io/pink sunglasses
As you can see, we use everything we get in the URL after the domain name as a text query to show the most similar results we have (links were not added to the results).
Create a 404 template using GPT-3 and DALL·E 2
In addition to the recommended product listing, I also ventured into creating the layout for the page using DALL·E 2 and the message copywriting using GPT-3.
DALL·E 2 for the visuals
We can start with something simple like this (prompt follows):
The layout of a creative 404 error page for a men and women fashion website.
When composing the prompt, we must remember a few essential rules, like adding men and women, to prevent the otherwise strong association between “fashion” and “woman” (an unwanted bias).
The layout for a 404 error page on a men and women fashion e-commerce.
Small variations can help. We can also get more creative and create one single photographic element that we will use on the page. Let’s try it.
Two models, a boy and a girl, depicted from the back in a 70’s discotheque. Neon style, 4k, hyper realistic photography.
Two models, a boy and a girl, in a 70’s discotheque. Neon style, 4k, hyper realistic photography. They look confused.
You can see how this could be further developed to create something unique for our website.
GPT-3 for the copywriting
Let’s quickly set up the prompt for the error message using GPT-3. Here follows a first zero-shot implementation.
Write a creative copy for the 404 error page aimed at fashion shoppers: Message: Ouch!
Here is the final output (in green).
Let’s make a small change to the prompt that would work well with our recommendations.
Write a creative copy for the 404 error page of a fashion e-commerce website. Suggest to website visitors that valide alternatives are available. Message:
This makes sense and I hope you got an idea of how all of this could work.
If you want learn more about Prompt engineering in SEO, I suggest to see our web story.
Conclusions And Future Work
As usual, we need to go into live testing and apply the right business logic to select the assets to be indexed. We want to make sure that there is a proper feedback loop to increase the quality of the results and therefore the sales on the website.
Jina NOW is progressing nicely, and while we have used text-to-text for automatic internal linking in e-commerce, we have experimented with image-to-text here. As far as implementation goes, things are similar. In reality, relevance always boils down to two crucial aspects:
- The quality of the data we send into the flow. The more we specialize the data set based on SEO best practices and business logic, the better it gets;
- The model used. We rely primarily on a fine-tuned model to create automated product descriptions. Also, in search and recommendations, it is crucial to improve the quality of the results by changing the parameters according to our dataset.
I finally want to thank the Jina AI team and particularly Joschka Braun, Mohammad Kalim Akram and Florian Hönicke leading the NOW team. Also special thanks to our Claudio Salatino for working on the PHP side of the project.
What’s the difference between 404 and 410?
A 404 tells the search engine that the page is not available at that location, and it has probably never been there (in other words the server has no idea about that resource and sends the correct HTTP status code — a 404 Not Found). The 410 indicates that a page is no longer available, but it used to be there in the past. In general, if you want to prevent Google from re-crawling a URL, you should go for a 410. In an experiment conducted by the team at Reboot, 404 have been crawled 49.6% more frequently than 410.
What is a soft 404?
A soft 404 is when the search engine receives a 200 HTTP status code but it thinks the page actually should be considered a 404. It can happen when we automatically redirect traffic to the homepage or when we have a blank page on the website.
Does Google index 404 pages?
No, Google will not index 404 error pages, they are simply a signal that the resource is not present at that location. However we can customize the error to engage the web site visitor.
Must Read Content
The Power of Product Knowledge Graph for E-commerce
Dive deep into the power of data for e-commerce
Why Do We Need Knowledge Graphs?
Learn what a knowledge graph brings to SEO with Teodora Petkova
Generative AI for SEO: An Overview
Use videos to increase traffic to your websites
SEO Automation in 2023
Improve the SEO of your website through Artificial Intelligence
Touch your SEO: Introducing Physical SEO
Connect a physical product to the ecosystem of data on the web