By Andrea Volpini

9 months ago

Discover a free tool for pinpoint query-content matching with unrivaled precision. Elevate your web search experience today!

In the Search Engine Optimization (SEO) world, achieving relevance is a crucial goal driving strategic initiatives and tactical implementation.

A few weeks ago, Paul Thomas and a group of researchers from Microsoft captured Dawn Anderson and subsequently my attention by publishing a revolutionary paper titled “Large language models can accurately predict searcher preferences” on how to use large language models (LLMs) to generate high-quality relevance labels to improve the alignment of search queries and content. 

Both Google and Bing have heavily invested in relevance labeling to shape the perceived quality of search results. In doing so, over the years, they faced a dilemma – ensuring scalability in acquiring labels while guaranteeing these labels’ accuracy. Relevance labeling is a complex challenge for anyone developing a modern search engine, and the idea that part of this work can be fully automated using synthetic data (information artificially created) is simply transformative.

Before diving into the specifics of the research, let me introduce a new free tool to evaluate the match between a query and the content of a web page that takes advantage of Bing’s team insights.

I reverse-engineered the setup presented in the paper, as indicated by Victor Pan in this Twitter Thread.

How To Use The Search Intent Optimization Tool

  1. Add the URL of the webpage you wish to analyze.
  2. Provide the query the page aims to rank for.
  3. Enter the search intent, this is the narrative behind the information needed by the user.

We provide a simple traffic light system to show how well your content matches the search intent. 

(M) Measure how well the content matches the intent of the query. 

(T) Indicates how trustworthy the web page is.

(O) Considering the aspects above and the relative importance of each provides the score as follows:

2 = highly relevant, very helpful for this query

1 = relevant, may be partly helpful but might contain other irrelevant content

0 = not relevant, should never be shown for this query

Let’s Run A Quick Validation Test

While we are still working on conducting a more extensive validation test, here is how the experiment is setup: 

  • We’re looking for top-ranking and least-ranking queries (along with their search intent) behind blog posts on our website;
  • We’re evaluating how the tool considers these two classes of queries;
  • We manually labeled the match between content and query (ground truth) and we are analyzing the gap between the human labels and the synthetic data. 

The exact page (a blog post on how to get a knowledge panel), while trustworthy, is obviously a good match for the query “how to get a knowledge panel” and it doesn’t match at all the query “making carbonara” (ok, this one was easy). 

Here is one more example. In the blog post for AI plagiarism, the tool finds relevancy for the query “ai plagiarism checker” but finds only partially the content relevant for the query “turing test”.

Current Limitations

While this tool is free, its continued availability is not guaranteed. It operates using the WordLift Inspector API, which currently does not support JavaScript rendering. Therefore, the tool will not function if you’re analyzing a webpage rendered client-side using JavaScript. I meticulously replicated the same configuration described in the paper (GPT-4 on Azure OpenAI) but the system is currently running on a single instance and you have to be patient while waiting for the final result.

What We Learned From Microsoft’s Research

Relevance labels, crucial for assessing search systems, are traditionally sourced from third-party labelers. However, this can result in subpar quality if labelers need to grasp user needs. The paper suggests employing large language models (LLMs) enriched with direct user feedback can generate superior relevance labels. Trials on TREC-Robust data revealed that LLM-derived labels rival or surpass human accuracy

When implemented at Bing, LLM labels outperformed trained human labelers, offering cost savings and expedited iterations. Moreover, integrating LLM labels into Bing’s ranking system boosted its relevance significantly. While LLM labeling presents challenges like bias, overfitting, and environmental concerns, it underscores the potential of LLMs in delivering high-quality relevance labeling.

This is incredibly valuable for SEOs when evaluating how the content on a web page matches a target search intent.

Google’s Quality Raters

Google utilizes a global team of approximately 16,000 Quality Raters to assess and enhance the quality of its search results, ensuring they align with user queries and provide value. This Quality Raters program, operational since at least 2005, employs individuals via short-term contracts to evaluate Google’s Search Engine Results Pages (SERPs) based on specific guidelines, focusing mainly on the quality and relevance of displayed results.

Google Quality Raters follow a meticulous process defined by Google’s guidelines to evaluate webpage quality and the alignment of page content with user queries. They evaluate the page’s ability to achieve its purpose using E-E-A-T parameters (Experience, Expertise, Authoritativeness, and Trustworthiness). They also ensure that the content effectively satisfies user needs and search intent.

Although Quality Raters do not directly influence Google’s rankings, their evaluations indirectly impact Google’s search algorithms. Their assessments, particularly regarding whether webpages meet specified quality and relevance criteria, guide algorithm adjustments to enhance user experience and satisfaction. This human analysis is crucial for identifying and mitigating issues, such as disinformation, that might slip through algorithmic filters, ensuring that SERPs uphold high standards of quality and relevance.

Moreover, the Quality Raters’ feedback, especially on the usefulness or non-usefulness of search results, also aids in training Google’s machine learning algorithms, enhancing the search engine’s ability to deliver increasingly relevant and high-quality results over time. This is pivotal for YMYL (Your Money or Your Life) topics, which require elevated scrutiny due to their potential impact on users’ health, finances, or safety. The feedback and evaluations from the Quality Raters, therefore, serve as a valuable resource for Google in its continual quest to refine and optimize its search algorithms and maintain the efficacy of its search results.

To learn more about Google’s quality raters Cyrus Shepherd has written recently about his experience as quality raters for Google. Cyrus’s article is super interesting and informative as always!

Conclusions And Future Work

We aim to continue enhancing our content creation tool by merging knowledge graphs with large language models. Research like the one presented in this article can significantly improve the process of output validation. In the coming weeks we plan to extend the validation tests and compare rankings from Google Search Console with results from the Search Intent Optimization Tool to assess its value in the realm of SEO across multiple verticals.

If you’re interested in producing engaging and informative content on a large scale or review your SEO strategy, drop us an email!

Must Read Content

The Power of Product Knowledge Graph for E-commerce
Dive deep into the power of data for e-commerce

Why Do We Need Knowledge Graphs?
Learn what a knowledge graph brings to SEO with Teodora Petkova

Generative AI for SEO: An Overview
Use videos to increase traffic to your websites

SEO Automation in 2024
Improve the SEO of your website through Artificial Intelligence

Touch your SEO: Introducing Physical SEO
Connect a physical product to the ecosystem of data on the web

Are you ready for the next SEO?
Try WordLift today!