Select Page
Dominate Black Friday & Cyber Monday with Strategic SEO Techniques

Dominate Black Friday & Cyber Monday with Strategic SEO Techniques

Table of contents:

  1. Why BlackFriday and CyberMonday SEO tactics matter for advanced ecommerce in 2024
  2. Black Friday & Cyber Monday SEO in the era of Generative AI
  3. SEO changed with generative AI, Google’s updates and economy shifts
  4. What Twitter and Independent Research Have Taught Us
  5. Underutilized Black Friday & Cyber Monday Schema Markups
  6. Selling more and going beyond prompting and schema markups

Why Black Friday And Cyber Monday SEO Tactics Matter For Advanced E-commerce In 2024

Black Friday and Cyber Monday SEO have transformed significantly in 2024 due to advancements in generative AI, Google’s updates and AI-driven SEO software. If you’re in one of these SEO teams executing strategies that are not in-line with generative AI efforts, you need to reconsider pivoting and bolstering your AI SEO initiatives. Competition during these two events is fierce and it won’t be easy to stand out with average funnel optimization and subpar customer journeys. Your potential customers are more demanding than ever, so your mindset should shift too.

Why am I saying this? 

The significance of SEO tactics during Black Friday and Cyber Monday (BFCM) in 2024 for advanced e-commerce is multifaceted and pivotal for the success of any online business aiming to capitalize on these peak shopping periods. As competition reaches its zenith, I know that having a robust SEO strategy helps e-commerce sites stand out in search engine results, thus drawing in more organic traffic and potential sales. Black Friday and Cyber Monday represent some of the year’s highest revenue potentials, making it critical for businesses to secure top rankings in search results where shoppers are most active. Moreover, optimizing SEO is not just about visibility; it also encompasses enhancing user experience, tailored content and promotions to specific market segments. 

If I were you, I would like to be a business that engages more effectively with my target audience, leading to improved customer interactions and sales outcomes. Technical preparedness and content scaling are crucial aspects in this setup: they involve ensuring that websites can manage the surge in Black Friday and Cyber Monday traffic and adhere to SEO best practices that maintain visibility and functionality.
It’s easier said than done, believe me when I say…but I’ll make a promise to give my best to describe potential ways to prepare for the Black Friday and Cyber Monday period. I hope you’ll enjoy them and be patient with me, my dear reader.

Black Friday SEO In The Era Of Generative AI

Generative AI, such as ChatGPT, Google Bard, Bing Chat, Adobe Firefly, Perplexity AI, Midjourney, and so on, holds immense potential to disrupt various industries, including marketing and creative work. When it comes to Black Friday SEO, generative AI can play a pivotal role in content generation, automation, personalization, and efficiency. 

Even more, there have been some substantial changes in how merchant data is interacting with the Google shopping experience in the past year. We can observe that there are some tectonic shifts for merchants to consider.

Bing suggesting new products using AI

Bing is using AI to suggest products – according to their official statement “Bing’s goal is to bring more joy to shopping—from the initial spark of inspiration to the exciting unboxing experience—by making the process easier and giving you confidence, you’re getting the right item at the right price.”

What can you do to stay ahead of the curve? Our vision and recommended approach is to anticipate the idea of semantic, ontology-based prompting, where structured data is fed into a large language model (LLM) and used for validating the output. WordLift is proud to be one of the companies that pioneer ontology-based prompting to help you construct data-informed prompts which will use data programmatically.

How Marketers Can Prepare Their Black Friday SEO Plans

WordLift has been on the forefront of genAI x SEO innovation and we’ve perfected the AI snapshot for our clients before. I saw the mistakes, the grit and the innovative efforts where we poured our hearts into crafting stunning customer experiences that vowed users.

In this area of Black Friday, Cyber Monday, and holiday sales in general, we have direct experience with the following:

  1. Merchant Feed + Structured Data are interconnected. Data out of sync will stop your ads campaign. Merchant metadata is richer at the moment (Google is trying to make them converge but there are still differences as for the sale price). Instead of waiting, we can establish and use your data in your product knowledge graph on top of the PKG to run technical SEO optimizations.
  2. We can re-generate using data about sales (or campaigns) product descriptions or Product Listing Pages (PLPs) intro text at scale to boost sales by feeding to a large language model (LLM):
    • the existing description
    • the sales price of the items we want to promote
    • a few examples of product descriptions or PLP intro text that would work effectively for SEO (or creative copy taken from the campaign).

To sum up, the quality of your data and data curation workflows is crucial for this to perform in the best possible way.

There is nothing more powerful than utilizing what you have on your side in the first place. Do you want to learn how you can bring your business to the next level? Book a demo.

Here are also some other ways that SEO marketers can effectively use to prepare for Black Friday in the era of generative AI:

  1. Streamline Content Creation: Generative AI empowers marketers to streamline content creation for marketing purposes. By leveraging AI models, marketers can effortlessly generate text and content that aligns with their brand’s style and tone. This automation saves valuable time and resources by handling the generation of product descriptions, promotional emails, blog posts, and landing page content specifically tailored for Black Friday & Cyber Monday.
  2. Personalization and Targeting: Generative AI enables marketers to analyze consumer behavior patterns and preferences, leading to the creation of personalized content. Leveraging AI and ML, marketers can deliver targeted marketing campaigns, tailored offers, and recommendations that resonate with individual customers. This personalized approach significantly boosts customer engagement and conversion rates during the Black Friday & Cyber Monday shopping season.
  3. SEO Optimization: Black Friday is an intensely competitive period for online retailers. Generative AI can aid marketers in optimizing their SEO strategies by generating relevant and keyword-rich content. AI models analyze search trends, identify popular keywords, and produce optimized meta descriptions, titles, and product descriptions, ultimately enhancing organic search visibility during Black Friday.
  4. Efficient Marketing Automation: Generative AI facilitates the automation of repetitive marketing tasks, freeing up marketers to focus on strategic initiatives. AI-powered chatbots can handle customer service inquiries, provide real-time support, and offer personalized recommendations based on user preferences. This automation elevates the customer experience and allows marketers to dedicate more time to other critical aspects of Black Friday campaigns.
  5. Embrace AI and Automation for Fulfillment: Black Friday triggers a surge in online orders, demanding efficient fulfillment from retailers. AI-driven tools, such as robotics and automated software, optimize warehouse operations, streamline order processing, and ensure round-the-clock fulfillment centers. By embracing AI and automation, retailers can steer clear of stockouts, enhance customer satisfaction, and capitalize on the growing trend of online shopping during Black Friday & Cyber Monday.

To effectively prepare for Black Friday in the era of generative AI, marketers must prioritize the adoption of AI technologies and leverage the benefits they offer. Staying updated with the latest advancements is crucial, as it allows marketers to explore how these technologies can enhance content creation, personalization, automation, and SEO optimization. By embracing generative AI and seamlessly integrating it into their Black Friday marketing strategies, marketers can deliver more efficient and impactful campaigns, gain a competitive edge, and achieve better results in the dynamic landscape of Black Friday & Cyber Monday sales.

The key to maximizing the potential of generative AI lies in integrating it with an intelligent content framework. This framework should incorporate carefully selected schema markups to facilitate content comprehension and harness the advantages of organic rankings.

SEO Changed With Generative AI, Google’s Updates And Economy Shifts

Try searching for the keyphrase “use generative AI for SEO” and you’ll see approximately 12 million results. Here’s the screenshot from German SERPs below for that keyphrase: if you thought that this is just another trend like everything else that you’ve seen in the past years, I’ll advise you to think again. Take my advice, I speak from experience.

Don’t believe me? From what I’ve observed in 2024 so far, the landscape of SEO has been reshaped significantly due to the advent of generative AI, Google’s latest updates, and the economic climate. Generative AI has revolutionized content generation, allowing for the crafting of engaging and highly targeted content more efficiently. This advancement aids in enhancing the visibility of websites on search engines by ensuring content is not only relevant but also highly personalized to meet user needs.
Last year, I wrote “Generative AI, such as ChatGPT, Google Bard, Bing Chat, Adobe Firefly, Perplexity AI, Midjourney, and so on, holds immense potential to disrupt various industries, including marketing and creative work. When it comes to Black Friday SEO, generative AI can play a pivotal role in content generation, automation, personalization, and efficiency. Even more, there have been some substantial changes in how merchant data is interacting with the Google shopping experience in the past year. We can observe that there are two tectonic shifts for merchants to consider.” I was right. The changes did arrive, as well as the studies for estimated reduction in organic traffic after search generative experience (SGE) is fully enabled.

The next thing that happened was In March 2024, when Google rolled out a major update focused on improving the quality of content surfaced in search results. This update has prioritized high-quality, helpful content while penalizing sites that offer little value in terms of originality or usefulness. The intent is clear: to refine search results so that they provide more value, pushing marketers and content creators to invest in substantial and useful content. No negotiations on that one and honestly – it was a hard ride.

Additionally, the economic conditions of 2024 have influenced SEO strategies. With budgets possibly tightened, there is a greater need for SEO efforts to be more strategic and effective. Companies are now more focused on optimizing their ROI, ensuring every dollar spent on SEO can be justified with tangible improvements in traffic and sales conversions. I know and witness this first-hand, so I’m trying to pass my wisdom to you, my dear reader, so that you don’t get burned on the same stuff I did.

What Twitter And Independent Research Have Taught Us

The end-of-the-year holiday season is probably the most interesting time for buyers worldwide: everyone is waiting for Black Friday & Cyber Monday to come up and catch a good deal. Customer behavior reshaped a lot, especially after Covid-19: businesses that were late to their digital transformation and understanding the importance of selling on the Internet started shifting their mindsets and preparing themselves to build their business online. 

Your success and profits, besides other stuff, depend on how well you utilize best-structured data practices to support and aid the experience of your online buyers. Users are thirsty to live better experiences and find what they want faster and at scale. Therefore, your number one priority should be to put your best deals and offers on your website in a way that is easily reachable and understandable and stand out in the huge online battlefield. So, let’s go!

We ran a Twitter thread to ask the SEO community (special thanks goes to Rich Tatum) about the least used structured data SEO markups for Black Friday. At the same time, we performed independent research where we analyzed over 107+ popular e-commerce stores (Black Friday pages only) in Switzerland, Germany, the UK, France, Spain, Italy, Netherlands, Belgium, Sweden, Norway, Austria, Europe, and worldwide in general, ordered by their popularity defined by the profits that they gain. 

Some of the most notable e-commerce brands involve Amazon, eBay, migros.ch, microspot.ch, digitec.ch, Mediamarkt & MediaWorld stores, bol.com, Decathlon, Tesco, Zalando, Otto, Carrefour, Next, Very, Argos, Wish, Asda, Asos, IKEA, Coop, H&M, Rewe, Lidl,  Matas, Zara, Schein, idealo, Boulanger, GearBest, Privalia, Global Savings Group, Anibis.ch, Groupon, Alibaba, AliExpress, Flipkart, Walmart and many more. Here’s what we learned through our automated scraping and summarizing process:

  1. Most of the businesses do not even have structured data (over 54%+ of them) or they use the basic schema markups for LocalBusiness, Website, and/or Organization;
  1. BreadcrumbList, Offer, and FAQPage are partially underutilized in the study;
  1. SaleEvent, CollectionPage, OpeningHoursSpecification, OfferCatalog, imageObject and videoObject, and discountCode were the most underused Black Friday schema markups.

Underutilized Black Friday & Cyber Monday Schema Markups – A List

Let us elaborate on the second and the third ones independently, so that you can utilize their powerfully for your e-commerce SEO:

  • BreadcrumbList -> good to use when you have a collection of pages that are interlinked together. The rule here is to use your top pages first in your breadcrumb list while the upcoming pages develop from there, forming a list of ordered, sequential elements. ItemList can be also utilized here too.
  • Offer -> This is reserved for both online and offline deals that you want to showcase on your website, like selling a ticket for an event, streaming occasions, and so on. It goes well in combination with paymentMethod, areaServed, aggregateRating, availabilityStarts, availabilityEnds, category, offeredBy, GTIN, and similar attributes that proved to be helpful in the process.
  • FAQPage -> Great deals come with many unanswered customer questions around the new prices in specific time periods like Black Friday and Cyber Monday. That is why it is important to include the most prominent questions in your webpage by using the FAQPage schema markup. This structured data type goes well with mainContentOfPage, speakable, abstract, about, author, and rating attributes when combined together.
  • SaleEvent -> This one is definitely the most underused across the ecommerce industry. This schema markup is probably the most appropriate for temporary deals (therefore, the use of the word event in its name comes obvious). SaleEvent works perfectly with the audience, contributor, startDate, endDate, eventAttendanceMode, eventStatus (for postponing and rescheduling), location, Offers, subEvent, and sameAs properties. Definitely worth checking out. Ideal when you want to showcase the commercial intent of a webpage compared to the use of FAQPage schema markup which is more informational.
  • CollectionPage -> This is a more specific use of the Webpage schema markup. It can also be used with ItemList together with the mainEntity that references it. This way, it is clear to customers that they are having a collection of items on the webpage that they are seeing. Example:

{ “@context”: “http://schema.org”, “@type”: “CollectionPage”, “mainEntity”: { “@type”: “ItemList”, “itemListElement”: [ { “@type”: “ItemPage” }, { “@type”: “ItemPage” }, { “@type”: “ItemPage” } ] } };

  • OpeningHoursSpecification -> It is always wise to update your opening hours during specific season periods like the black week so that you can properly inform your online visitors (potential customers) about the right time to reach out to you. Great when used in combination with dayOfWeek, opens, closes, validFrom, validThrough, description, and sameAs attributes.
  • OfferCatalog -> This is basically an ItemList but it refers to a list of products that are offered by the same provider. Very important to know the difference here. Works pretty well when used with itemListOrder, alternateName, description, disambiguatingDescription (useful for differentiating between similar product items), name, sameAs and identifier properties. 
  • discountCode -> often misunderstood with the Offer schema markup which usually refers to product schema markup, while this one can refer to a service. This is still not part of Google’s Search Gallery, so you cannot expect to obtain a rich snippet on the SERPs by using it, however, it is still a good choice when providing discounts to stuff or working with a coupons website.
  • ImageObject and VideoObject are quite similar, so we will cover them together. They are particularly useful when you want to provide more images of your products or have a video overview (e.g. games products). The 3DModel schema markup can also prove interesting for advanced ecommerce brands who utilize the power of augmented reality to show off their products in a more interactive way with their audience during the black cyber season.
  • Language  schema can also be very interesting for you -> especially if you’re struggling to get a buy-in to implement hreflang (which from my experience requires a bit more complex technical setup). I’ve used it as a side-way in the past to fix issues when we had multiple stores with similar languages interfering with each other’s organic rankings. I can confirm first-hand that I managed to decrease irrelevant traffic coming to our core pages, so this is something that you can consider too.

Selling More And Going Beyond Prompting And Schema Markups

Customers’ expectations can go beyond prompting and simple schema markup fixing. In order to help you position yourself as competitively and intelligently as possible, we developed our Business + E-commerce Plan which uses the WordLift and Product Knowledge Graph Builder. By using both of them you can easily import your data from your Merchant Feed and enrich it with structured data, streamlining your schema markup creation and building a basis to develop new customer experience on top of that knowledge base.

FAQs

How can SEO help maximize Black Friday sales?

Search Engine Optimization (SEO) can play a critical role in maximizing Black Friday sales by driving targeted traffic to a company’s website and increasing visibility for key products and promotions. By optimizing website content, product pages, and landing pages for relevant keywords, companies can increase their chances of ranking higher in search engine results pages (SERPs) and attracting more potential customers to their sites. Additionally, SEO can help improve the user experience by ensuring that the website is easy to navigate and mobile-friendly, which can lead to higher conversion rates and increased revenue during the Black Friday sales event.

What are the top SEO techniques for Black Friday promotions?

The top SEO techniques for Black Friday promotions include:

  1. Keyword research and optimization: Identify relevant keywords and phrases related to Black Friday deals and promotions, and optimize your website content, product pages, and meta tags to rank higher in search engine results.
  2. On-page optimization: Ensure that your website is optimized for Black Friday, with clear navigation, fast loading times, and mobile responsiveness.
  3. Content creation and promotion: Create valuable and relevant content related to Black Friday promotions, and promote it through social media, email marketing, and other channels.
  4. Structured data markup: Use structured data markup, such as schema.org, to provide search engines with additional information about your Black Friday deals and promotions.
  5. Backlinks: Acquire high-quality backlinks from reputable websites to improve your website’s authority and relevance.
  6. Social media promotion: Use social media platforms to promote your Black Friday deals and promotions, and engage with customers to build brand awareness.
  7. Local SEO: Optimize your website for local search, including creating a Google My Business listing and obtaining customer reviews.
  8. Influencer marketing: Partner with influencers in your industry to promote your Black Friday deals and reach new audiences.
  9. A/B testing: Test different variations of your website and marketing campaigns to identify the most effective strategies for driving traffic and sales.
  10. Analytics and tracking: Track your website traffic, conversion rates, and other key metrics to measure the effectiveness of your SEO efforts and make data-driven decisions.

Are there any specific SEO tips for optimizing Black Friday landing pages?

Here are some SEO tips for optimizing Black Friday landing pages:

  1. Target relevant keywords throughout your landing page content.
  2. Craft compelling meta titles and descriptions.
  3. Optimize page speed for a better user experience.
  4. Ensure mobile optimization for smartphone users.
  5. Include a clear and persuasive call-to-action (CTA).
  6. Create high-quality and engaging content.
  7. Utilize internal and external linking strategies.
  8. Integrate social sharing buttons for increased visibility.
  9. Monitor and analyze landing page performance using web analytics tools.

What are the common SEO mistakes to avoid for Black Friday campaigns?

Here are some common SEO mistakes to avoid for Black Friday campaigns:

  1. Neglecting keyword research and targeting the wrong audience.
  2. Ignoring page speed optimization leads to high bounce rates.
  3. Overlooking mobile optimization, impacting rankings and user experience.
  4. Having poorly written or thin content that lacks quality and relevance.
  5. Missing or poorly optimized meta tags, reducing click-through opportunities.
  6. Neglecting internal and external linking strategies.
  7. Failing to incorporate social sharing buttons for increased visibility.
  8. Inadequate monitoring and analysis of key metrics for optimization.

Meeting DSPy: From Prompting to Programming Language Models

Meeting DSPy: From Prompting to Programming Language Models

Are you exhausted from constantly prompting engineering tasks? Do you find the process fragile and tiresome? Are you involved in creating workflows using language models? If so, you might be interested in DSPy. This blog post provides a gentle introduction to its core concepts.

While building robust neuro-symbolic AI workflows, we’ll explore the synergy between LMs and graph knowledge bases within digital marketing and SEO tasks.

Table of content:

  1. What is DSPy?
  2. Let’s Build Our First Agents
  3. Automated Optimization Using DSPy Compiler
  4. Creating a Learning Agent
  5. Implementing Multi-Hop Search with DSPy and WordLift
  6. Conclusion and Future Work

What is DSPy?

DSPy is an acronym that stands for Declarative Self-Improving Language Programs. It is a framework developed by the Stanford NLP team that aims to shift the focus from using LMs with orchestrating frameworks like LangChain, Llama Index, or Semantic Kernel to programming with foundational models. This approach addresses the need for structured and programming-first prompting that can improve itself over time.

A Real Machine Learning Workflow🤩

For those with experience working with PyTorch and machine learning in general, DSPy is an excellent tool. It is designed around the concept of constructing neural networks. Let me explain: when starting a machine learning project, you typically begin by working on datasets, defining the model, running the training, configuring the evaluation, and testing.

DSPy simplifies this process by providing general-purpose modules like ChainOfThought and ReAct, which you can use instead of complex prompting structures. Most importantly, DSPy brings general optimizers (BootstrapFewShotWithRandomSearch or BayesianSignatureOptimizer), which are algorithms that will automatically update the parameters in your AI program.

You can recompile your program whenever you change your code, data, assertions, or metrics, and DSPy will generate new effective prompts that fit your modifications.

DSPy’s design philosophy is the opposite of thinking, “Prompting is the future of programming.” you will build modules to express the logic behind your task and let the framework deal with the language model.

Core Concepts Behind DSPy💡

Let’s review the fundamental components of the framework:

  • Signatures: Here is how you abstract both prompting and fine-tuning. Imagine signatures as the core directive of your program (e.g., read a question and answer it, do sentiment analysis, optimize title tags). This is where you define the inputs and outputs of your program; it’s the contract between you and the LM. Question answering is represented, for example as question -> answer, sentiment analysis is sentence -> sentiment, title optimization is title, keyword -> optimized title and so on.
  • Modules: They are the building blocks of your program. Here is where you define how things shall be done (e.g., use Chain of Thought or Act as an SEO specialist, etc.). Using parameters here, you can encode your prompting strategies without directly messing up with prompt engineering. DSPy comes with pre-defined modules (dspy.Predict, dspy.ChainOfThought, dspy.ProgramOfThought, dspy.ReAct, and so on ). Modules use the Signature as an indication of what needs to be done.
  • Optimizers: to improve the accuracy of the LM, a DSPy Optimizer will automatically update the prompts or even the LM’s weights to improve on a given metric. To use an optimizer, you will need the following:
    • a DSPy program (like a simple dspy.Predict or a RAG),
    • a metric to assess the quality of the output,
    • a training dataset to identify valuable examples (even small batches like 10 or 20 would work here).

Let’s Build Our First Agents 🤖

Before further ado, let’s begin with a few examples to dive into the implementation by following a simple Colab Notebook.

We will:

  • Run a few basic examples for zero-shot prompting, entity extraction, and summarization. These are simple NLP tasks that we can run using LMs. We will execute them using DSPy to grasp its intrinsic logic.
  • After familiarizing ourselves with these concepts, we will implement our first Retrieval-Augmented Generation (RAG) pipeline. Retrieval-augmented generation (RAG) allows LMs to tap into a large corpus of knowledge from sources such as a Knowledge Graph (KG). The RAG will query the KG behind this blog to find the relevant passages/content that can produce a well-refined response. Here, we will construct a DSPy retriever using WordLift’s Vector Search. It is our first time using this new functionality from our platform 😍.  
  • We will then:
    a. Compile a program using the RAG previously created.
    b. Create a training dataset by extracting question-answer pairs from our KG (we will extract a set of schema:faqPages).
    c. Configure a DSPy Optimizer to improve our program.
    d. Evaluate the results.

A Simple Workflow For Content Summarization🗜️ 

Let’s create a signature to elaborate a summary from a long text. This can be handy for generating the meta description of blog posts or other similar tasks. We will simply instruct DSPy to use long_context -> tldr using the Chain Of Thought methodology.

Worth noticing we didn’t have to write any prompts! 

WordLift DSPy Retriever🔎

The next step is to use WordLift’s Knowledge Graph and its new semantic search capabilities. DSPy supports various retrieval modules out of the box, such as ColBERTv2, AzureCognitiveSearch, Pinecone, Weaviate, and now also WordLift 😎. 

Here is how we’re creating the WordLiftRetriever, which, given a query, will provide the most relevant passages.

Building a RAG, once we have a retriever,  using DSPy is quite straightforward. We begin by setting up both the language model and the new retriever with the following line: 

dspy.settings.configure(lm=turbo, rm=wl_retriever)

The RAG comprises a signature made of a context (obtained from WordLift’s KG), a question, and an answer.

Automated Optimization Using DSPy Compiler

Using the DSPy compiler, we can now optimize the performance or efficiency of an NLP pipeline by simulating different program versions and bootstrapping examples to construct effective few-shot prompts.

I like this about DSPy: not only are we moving away from chaining tasks, but we’re using programming and can rely on the framework to automate the prompt optimization process. 

The “optimizer” component, previously known as the teleprompter, helps to refine a program’s modules by prompting or fine-tuning the optimization process. This is the real magic of using the DSPy framework. As we feed more data into our Knowledge Graph, the AI agent we create using DSPy evolves to align its generation with the gold standard we have established.

Let’s Create a DSPy Program

DSPy programs like the one built in the Colab help us with tasks like question answering, information extraction, or content optimization.

As with traditional machine learning, the general workflow comprises these steps:

  • Get data. To train your program, you will need some training data. To do this, you should provide examples of the inputs and outputs that your program will use. For instance, collecting FAQs from your blog will give you a relevant set of question-answer pairs. Using at least 10 samples is recommended, but remember that the more data you have, the better your program will perform.
  • Write your program. Define your program’s modules (i.e., sub-tasks) and how they should interact to solve your task. We are using primarily a RAG with a Chain Of Thought. Imagine using control flows if/then statements and effectively using the data in our knowledge base and external APIs to accomplish more sophisticated tasks. 
  • Define some validation logic. What makes for a good run of your program? Maybe the answers we have already marked up as FAQs? Maybe the best descriptions for our products? Specify the logic that will validate that.
  • Compile! Ask DSPy to compile your program using your data. The compiler will use our data and validation logic to optimize the program (e.g., prompts and modules).
  • Iterate. Repeat the process by improving your data, program, and validation or using more advanced DSPy compiler features.

Creating a Learning Agent

By combining DSPy with curated data in a graph, we can create LM-based applications that are modular, easy to maintain, self-optimizing, and robust to changes in the underlying models and datasets. The synergies between semantic data and a declarative framework like DSPy enable a new paradigm of LLM programming, where high-level reasoning strategies (i.e., optimize the product description by reading all the latest reviews) can be automatically discovered, optimized, and integrated into efficient and interpretable pipelines. 

DSPy is brilliant as it creates a new paradigm for AI agent development. Using the DSPy compiler we can ground the generation in the information we store in our knowledge graph and have a system that is self-optimizing and easier to be understood. 

Here is DSPy’s teleprompter and compiler pipeline, which helps us create a modular, extensible, self-optimizing RAG system that adapts by leveraging human-annotated question-answer pairs on our website! 

When dealing with complex queries that combine multiple information needs, we can implement a sophisticated retrieval mechanism, Multi-Hop Search (“Baleen” – Khattab et al., 2021), to help us find different parts of the same query in different documents. 

Using DSPy, we can recreate such a system that will read the retrieved results and generate additional queries to gather further information when necessary. 

We can do it with only a few lines of code.

Let’s review this bare-bone implementation. The __init__ method defines a few key sub-modules: 

  • generate_query: We use the Chain of Thought predictor within the GenerateSearchQuery signature for each turn. 
  • retrieve: This module uses WordLift Vector Search to do the actual search using the generated queries. 
  • generate_answer: This dspy.Predict module is used after all the search steps. It has a GenerateAnswer to produce the final answer. The forward method uses these sub-modules in simple control flow. First, we’ll loop up to self.max_hops times. We generate a search query during each iteration using the predictor at self.generate_query[hop]. We’ll retrieve the top-k passages using that query. We’ll add the (deduplicated) passages to our accumulator of context. After the loop, we’ll use self.generate_answer to produce the final answer. We’ll return a prediction with the retrieved context and the predicted answer.

Quite interestingly, we can inspect the last calls to the LLM with a simple command: turbo.inspect_history(n=3). This is a practical way to examine the extensive optimization work done automatically with these very few lines of code. 

Conclusion and Future Work

As new language models emerge with advanced abilities, there is a trend to move away from fine-tuning and towards more sophisticated prompting techniques.

The combination of symbolic reasoning enabled by function calling and semantic data requires a robust AI development and validation strategy. 

While still at its earliest stage, DSPy represents a breakthrough in orchestration frameworks. It improves language programs with more refined semantic data, clean definitions, and a programming-first approach that best suits our neuro-symbolic thinking

Diving deeper into DSPy will help us improve our tooling and Agent WordLift’s skills in providing more accurate responses. Evaluation in LLM applications remains a strategic goal, and DSPy brings the right approach to solving the problem. 

Imagine the potential advancements in generating product descriptions as we continuously enrich the knowledge graph (KG) with additional training data. Integrating data from Google Search Console will allow us to pinpoint and leverage the most effective samples to improve a DSPy program. 

Beyond SEOs and digital marketing, creating self-optimizing AI systems raises ethical implications for using these technologies. As we develop increasingly powerful and autonomous AI agents and workflows, it is vital that we do so responsibly and in a way that is fully aligned with human values.

Are you evaluating the integration of Generative AI within your organization to enhance marketing efforts? I am eager to hear your plans. Drop me a line with your thoughts.

References

AI Content Protection: Understanding Watermarking Essentials

AI Content Protection: Understanding Watermarking Essentials

AI has transformed content creation, enabling the production of text, images, video, and music with unprecedented ease and speed. However, this remarkable progress also introduces significant ethical and transparency challenges in using AI-generated content.

This situation threatens the intellectual property rights of those who develop and train AI systems and the overall value and integrity of the content produced. To combat these problems, measures must be implemented to ensure that AI-generated works are used responsibly and that their creators are duly recognized.

The concept of AI watermarking, a mechanism designed to embed a unique and identifiable mark within AI-generated content, has been introduced. This helps make the origin of content explicit and thus makes it straightforward to users as to what was created by whom (in this case, AI vs. human).

In this article, we will explore the importance of AI watermarking and the various methods available and discuss the challenges of implementing these protections.

In addition, we will examine the implications of the AI Act, standards for AI-generated content, Google’s position on AI-generated content recognition, and its efforts in AI watermarking. This comprehensive overview highlights the importance of ethical practices in creating AI content and the steps taken to ensure its responsible use.

In this blog, we’ll cover: 

What is AI Watermarking?

AI watermarking is a method used to protect and identify AI-generated images and written content like blog posts. In simple terms, it involves embedding sophisticated watermarks and secret patterns into ‌content created by AI tools.

This digital watermark isn’t just any random marker — it’s a specific identifier unique to the creator or model developer. It can take multiple forms (visible or invisible), depending on the needs of the content and its intended use.

The way AI watermarking works is quite fascinating. When AI produces content, a watermark — a series of data points, patterns, or codes — is integrated into the content.

These subtle patterns don’t alter the quality or appearance of the content for the end user. However, specific tools or techniques can detect and read this embedded data.

Suppose ‌AI-generated content is used without permission. In that case, the watermark traces the content back to its source, proving its origin and helping enforce intellectual property rights.

This mechanism is crucial where content can be easily copied and distributed, ensuring creators and model developers maintain control and recognition for their work.

Why AI Watermarking Matters

As generative AI models evolve and become more capable of creating diverse content, the need to safeguard these creations becomes critical. 

Without protective measures, AI-generated work is susceptible to various risks, the most concerning being theft and unauthorized use.

In an age where tools like Wordable help content teams publish and promote more digital content than ever, the absence of a watermark means that creators and AI developers may lose control over their work. This leads to potential revenue loss and dilutes the credit and recognition that creators rightfully deserve.

Moreover, unwatermarked AI work can be misused or misrepresented. As a result, it could harm the reputation of the creator of the AI system.

That’s why AI watermarking serves as a crucial tool to uphold the rights of creators and model developers and fosters a more responsible and ethical use of AI content.

Introduce the AI Act and its relevance to AI watermarking

The European Parliament’s recent approval of the AI Act marks a significant milestone in the regulation of artificial intelligence technologies within the EU. This groundbreaking legislation aims to ensure that AI systems, including generative AI models such as ChatGPT, adhere to strict transparency requirements and comply with EU copyright law.

Among the key obligations outlined in the law is the need for AI-generated content to be clearly identifiable as such. This is particularly important when the content is intended to inform the public about matters of public interest, where it must be explicitly labeled as artificially generated. This directive includes not only text, but also audio and video content, highlighting the law’s comprehensive approach to AI regulation.

The law’s emphasis on the identifiability of AI-generated content underscores the growing importance of “watermarking for AI,” a practice that ensures that AI-created digital content can be distinguished from human-generated content. As the AI Act takes effect, watermarking for AI will play a key role in maintaining transparency and trust in the digital landscape by ensuring that consumers can easily recognize AI-generated content.

AI Watermarking Methods

AI watermarking can be categorized into visible and invisible (or hidden) watermarks.

Visible watermarks

These are overt markers that are easily perceptible to the viewer. They’re often used in images and videos to denote clear ownership or origin.

AI-powered visible watermarks come in various forms, each tailored to specific needs and applications.

Text-based watermarks

Here, the AI algorithm creates and embeds textual information like names, logos, or copyright notices directly onto the content. These can be customized in font, size, color, and placement to ensure visibility without detracting from the content’s aesthetics.

Graphic watermarks

Graphic watermarks embed symbols, logos, or other graphic elements. AI can adapt the watermark’s opacity and blending to match the content. The goal is to ensure it’s noticeable but not obtrusive. 

This type of AI watermark is particularly popular in visual media, such as photographs and videos.

Pattern-based watermarks

In pattern-based watermarks, AI creates a unique and secret pattern or a series of shapes integrated into the content. These patterns can be geometric shapes, abstract designs, or even QR codes. AI helps in seamlessly integrating these subtle patterns into the content, sometimes even using color-matching techniques to maintain the overall look and feel.

Dynamic watermarks

These are particularly useful in video content, where the watermark changes position, size, or appearance throughout the video to prevent removal. 

AI algorithms can analyze the video content in real time and decide the most effective placement and form for the watermark. Like graphic watermarks, the main goal is to remain effective and minimally intrusive throughout the video.

Invisible watermarks

Unlike their visible counterparts, invisible watermarks are hidden within the content. They’re undetectable to the naked eye. These are often used when the visual integrity of the content is paramount.

Digital watermarks

Digital watermarks are ideal for images or videos. 

Why? They subtly modify ‌pixel values in images or video frames and are undetectable to the human eye. The only way to spot them is via specialized software. 

That said, this type of AI watermarking is popular in visual media to protect copyright without impacting the visual experience.

For instance, Google DeepMind developed a watermarking tool for AI-generated images, which subtly modifies certain pixels in an image to create a hidden pattern. 

The naked eye can’t tell if an image is watermarked. Another neural network can then detect this pattern, confirming whether the image has a watermark. 

(Image Source)

This method guarantees that the watermark can still be detected even after the image is edited or altered in some way, such as being screenshotted or resized.

Audio watermarking

In audio watermarking, information is embedded in an audio file at frequencies not detectable by the human ear. This method is preferred in the music industry to track and manage copyright in digital music distribution.

(Image Source)

Amazon, for example, uses an audio watermarking algorithm to embed watermarks in the audio signal of their Alexa ads.

Text watermarking

Text watermarking can fall into both visible and invisible categories. In the invisible method, the AI subtly alters characters or spaces in a document. These alterations are indiscernible during casual reading but can be identified to prove authorship or origin.

Data watermarking

In data watermarking, AI algorithms embed unique identifiers within a dataset. This framework is particularly important in machine learning, where datasets are assets. 

The watermark doesn’t significantly change the dataset’s statistical properties, ensuring it remains useful for its intended purpose while embedding proof of ownership.

Cryptographic watermarks

Cryptographic methods involve encoding a digital signature or hash into the content. It’s one of the more secure forms of watermarking, as the embedded information is encrypted and can only be decoded or verified with the correct key. 

In other words, it adds an extra layer of security and authentication to the content. Implementing a DMARC policy further strengthens email security, safeguarding against unauthorized access and ensuring secure communication channels.

Model watermarking

Model watermarking embeds a unique identifier or pattern into a machine-learning model. This watermark isn’t directly visible in the model’s output or behavior under normal operation. As a result, it’s a covert method to assert ownership or authorship of the model.

The watermark in model watermarking is often embedded during the model’s training process– achieved by introducing specific patterns or data into the training dataset, which the model then learns and integrates into its internal parameters.

The embedded watermark doesn’t significantly alter the model’s performance but can be detected by applying specific tests or inputs. This allows the original creator to claim ownership or detect unauthorized copies of the model.

Standards for AI-Generated Content

Given the relevance of the need to know clearly that a piece of content was or was not generated with AI, the International Press Telecommunications Council (IPTC) has taken a significant step forward by publishing a Photo Metadata User Guide. This guide provides comprehensive instructions on utilizing embedded metadata to mark content as “synthetic media,” explicitly indicating its creation by generative AI systems.

Further advancing the cause for transparency and authenticity in digital media, the Coalition for Content Provenance and Authenticity (C2PA) is at the forefront of developing technical standards. Through its C2PA Specification, the coalition aims to establish a robust framework for certifying media content’s source and history (or provenance). This initiative is crucial for ensuring the integrity of digital media and fostering trust in the digital ecosystem.

Google’s Approach to Watermarking AI

Google’s proactive measures to ensure the transparency and authenticity of AI-generated content through watermarking and metadata are commendable steps towards responsible AI usage. Sundar Pichai’s emphasis on embedding these features from the beginning highlights Google’s commitment to content authenticity. By advocating for the IPTC Digital Source Type property, Google aims to create a more transparent digital environment, although the implementation in Google Images search results is still a work in progress.

Despite these efforts, challenges remain in accurately recognizing AI-generated content and assessing its quality in terms of Expertise, Authoritativeness, Trustworthiness, and Experience (E-E-A-T). Google’s algorithms, while sophisticated, are not infallible and can sometimes struggle to differentiate between high-quality content and poorly crafted AI-generated material. An illustrative example provided by Andrea Volpini underscores this point vividly. He points out a glaring error in which the AI mistakenly suggested that Italy still has a dual currency, when in reality it switched to the euro some 25 years ago, an amusing but troubling demonstration of the potential of AI to spread inaccurate information.

This example not only showcases the limitations of AI in evaluating E-E-A-T but also underscores the importance of rigorous article fact-checking.

It ensures that information disseminated to the public is accurate, reliable, and trustworthy. Google’s initiatives, while forward-thinking, must be complemented by continuous improvements in AI’s ability to discern and evaluate the quality of content accurately. This includes enhancing AI’s understanding of context, historical facts, and the nuances of human knowledge to prevent the surfacing of misleading or incorrect information.

Potential challenges in AI Watermarking

Integrating watermarks into AI-generated content has emerged as a crucial strategy. This approach aims to provide clear indicators to users and search engines regarding the origins and production methods of digital content. However, implementing such a strategy demands a careful balance. The quality of the watermark, its robustness against tampering, and its detectability by humans and machines are all critical factors that must be meticulously managed.

A significant challenge in this domain, which also poses a considerable risk, is the dynamic nature of AI development. This is particularly evident in the trend towards utilizing synthetic data to train AI models. Recent research has shed light on a phenomenon known as Model Autophagy Disorder (MAD). MAD describes a cycle where an over-reliance on synthetic data, without incorporating sufficient real-world data, leads to a gradual decline in the quality and diversity of generative models. This issue underscores the complex interplay in AI content creation and raises important considerations for developing effective watermarking strategies.

In response to these challenges, there is a growing consensus on addressing these issues at the metadata level. One promising approach is introducing a new property within the Schema.org framework. This property would provide detailed information about the type of data utilized for content generation and the content generation process itself. This strategy aims to foster trust and credibility in AI-generated content by enhancing transparency and mitigating risks associated with synthetic data.

WordLift, operating at the intersection of AI and content creation, recognizes the significance of these developments. As a pioneer in the use of semantic technologies and AI to enhance digital content, WordLift is positioned to contribute to the discourse on watermarking AI-generated content. WordLift plays a pivotal role in shaping the future of ethical and transparent AI content creation by advocating for the adoption of advanced metadata strategies and supporting the integration of transparent content. Through its expertise in semantic web technologies and AI, WordLift is committed to promoting best practices that ensure the integrity and trustworthiness of digital content in the age of artificial intelligence.

Wrapping up

The rapid popularity of AI-generated content has created a pressing need for effective tools to safeguard intellectual property, verify authorship, and maintain the integrity of digital assets. Despite some hurdles in developing foolproof watermarking techniques, the benefits of AI watermarking can’t be overlooked.

These include:

  • Enhanced traceability of content to its source
  • Deterring unauthorized use
  • Plagiarism checking

It’s likely that, as AI continues to evolve, so too will the methods to protect and manage its outputs. AI watermarking methods will only become even more robust and secure.

Navigating the Future of AI Regulation: Insights from the WAICF 2024

Navigating the Future of AI Regulation: Insights from the WAICF 2024

Table of content:

The World AI Cannes Festival: Innovation, Strategic Partnerships and the Future of Humanity in the Age of AI

The World AI Cannes Festival (WAICF) stands as a premier event in Europe, attracting decision-makers, companies, and innovators at the forefront of developing groundbreaking AI strategies and applications. With an impressive attendance of 16,000 individuals, featuring 300 international speakers and 230 exhibitors, the festival transforms Cannes into the European hub of cutting-edge technologies, momentarily shifting focus from its renowned status as a global cinema stage.

This year marked WordLift’s inaugural participation in the festival, where we capitalised on the diverse opportunities the event offered. We were exposed to a myriad of disruptive applications such as the palm-based identity solution showcased by Amazon to streamlining payment and buying experience for consumers. Furthermore, we observed the emergence of strategic partnerships among key market players, exemplified by the collaboration between AMD and Hugging Face. As Julian Simon, Chief Evangelist of Hugging Face, aptly stated, “There is a de facto monopoly on computers today, and the market is hungry for supply.”

Engaging in thought-provoking discussions surrounding the future intersections of humanity and AI was a highlight of the event. One of the most captivating keynotes was delivered by Yann LeCun, the chief AI scientist of Meta and a pioneer in Deep Learning. LeCun discussed the limitations of Large Language Models (LLMs), emphasising that their training is predominantly based on language, which constitutes only a fraction of human knowledge derived mostly from experience. One of his slides provocatively titled “Auto-regressive LLMs suck” underscored his message that while machines will eventually surpass human intelligence, current models are far from achieving this feat. LeCun also shared insights into his latest work aimed at bridging this gap.

Navigating the Global Wave of AI Regulation

Allowing the more technically equipped participants to delve into discussions about the technical advancements showcased in Cannes, I will instead focus on a topic that, while less glamorous, holds great relevance: the anticipated impact of forthcoming AI regulation on innovation and players in the digital markets. This theme was prominent during the festival, with several talks dedicated to it, and many discussions touching upon related aspects of this trend.

Although in Europe the conversation predominantly revolves around the finalisation of the AI Act (with its final text expected in April 2024, following the EU Parliament’s vote), it’s essential to recognize that this is now a global trend. Pam Dixon, executive director of the World Privacy Forum, presented compelling data illustrating the exponential rise in governmental activities concerning AI regulation, highlighting the considerable variations in responses across jurisdictions. While some initially speculated that AI regulation might follow a path similar to GDPR, establishing a quasi-global standard in data protection to which most entities would adapt, it’s becoming evident that this won’t be the case. The OECD AI Observatory, for instance, is compiling a database of national AI policy strategies and initiatives, currently documenting over 1,000 policy initiatives from 70 countries worldwide.

One audience question particularly resonated with me: ‘If you are a small company operating in this evolving ecosystem, facing the challenges of this emerging regulatory landscape, where should you begin?’ To be honest, there’s no definitive answer to this question at the moment. Although the AI Act has yet to become EU law, and its effective enforcement timelines are relatively lengthy, WordLift, like many others in this industry, is already fielding numerous requests from customers seeking reassurance on our compliance strategies. Luckly, WordLift has been committed to fostering a responsible approach to innovation since its establishment.

Ethical AI and Compliance: WordLift’s Proactive Approach

For those working at the intersection of AI and search engine optimization (SEO), ethical AI practices are paramount concerns. WordLift has conscientiously crafted an approach to AI aimed at empowering content creators and marketers while upholding fundamental human values and rights. Previous contributions on this blog have covered various aspects of ethical AI, including legal considerations, content creation, and the use of AI in SEO for enterprise settings, explaining in details how WordLift translates the concept of trustworthy AI into company practices, ensuring that its AI-powered tools and services are ethical, fair, and aligned with the best interests of users and society at large. 

While the AI Act mandates that only high-risk AI system providers undertake an impact assessment to identify the risks associated with their initiatives and apply suitable risk management strategies, at WordLift we have proactively seized this opportunity to enhance communication with stakeholders, developing a framework articulating our company’s principles across four main pillars:

  1. Embracing a ‘Human-in-the-loop’ approach to combine AI-based automation with human oversight, in order to guarantee content excellence.
  2. Ensuring Data Protection & IP through robust processes safeguarding client data, maintaining confidentiality, and upholding intellectual property rights.
  3. Prioritising Security with a focus on safeguarding against potential vulnerabilities in our generative AI services architecture.
  4. Promoting Economic and Environmental Sustainability by committing to open-source technologies and employing small-scale AI models to minimise our environmental footprint.

We are currently in the process of documenting each pillar in terms of the specific choices and workflows adopted. 

Contextualising Corporate Strategies: Navigating Open Issues in AI Regulation within the Larger Landscape

However, it’s essential to contextualise SMEs and startups compliance policies in the bigger picture, where mergers and partnerships between major players providing critical upstream inputs (such as cloud infrastructure and foundation models) and leading AI startups have become a trend. 

This trend is exemplified by the recent investigation launched by the US Federal Trade Commission on generative AI partnership, and it usually suggests that the market for Foundation Models (FM) may be moving towards a certain degree of consolidation. This potential consolidation in the upstream markets could have negative implications for downstream markets where SMEs and startups operate. These downstream markets are mostly those in the red rectangle in the picture below, extracted from the UK CMA review of AI FM. Less competition in the upstream markets may lead to a decrease in the diversity of business models, and reduce both the degree of flexibility in using multiple FM and the accountability of FM providers for the outputs produced.

An overview of foundation model development, training and deployment

As highlighted by LeCun in his keynote, we need diverse AI systems for the same reason we need diverse press, and for this the role of Open Source is critical. 

In this respect, EU policymakers have landed, after heated debates, on a two tiers approach to regulation of FM. The first tier entails a set of  transparency obligations and a demonstration of compliance with copyright laws for all FM providers, with the exception of those used only in research or published under an open-source licence. 

The exception does not apply for the second tier, which covers instead models classified as having high impact (or carrying systemic risks, art 52a), a classification presumed on the amount of compute used for its training (expressed in floating-point operations, or FLOPs). According to the current text, today only models such as GPT-4 and Meta LLama-2 would find themselves falling into the second tier. While the tiering rationale has been criticised by part of the scientific community, the EU legislators seem to have accepted the proportional approach (distinctly treating different uses and development modalities) advocated by OS ecosystems and the compromise reached is viewed as promising by the OS community. 

The broad exemption of free and open-source AI models from the Act, along with the adoption of the proportionality principle for SMEs (art 60), appears to be a reasonable compromise at this stage. The latter principle stipulates that in cases involving modification or fine-tuning of a model, providers’ obligations should be limited to those specific changes. For instance, this could involve updating existing technical documentation to include information on modifications, including new training data sources. This approach could be successful in regulating potential risks associated with AI technology without stifling innovation.

However, as the saying goes, the devil is in the details. The practical implications for the entire AI ecosystem will only become apparent in the coming months or years, especially when the newly established AI Office, tasked with implementing many provisions of the AI Act, begins its work. Among its many responsibilities, the AI Office will also oversee the adjustment of the FLOPs threshold over time to reflect technological and industrial changes.

In the best case scenario, legislative clarity will be achieved in the next months through a flooding of recommendations, guidelines, implementing and delegated acts, codes of conduct (such as the voluntary codes of conduct introduced by art 69 for the application of specific requirements). However, there is concern about the burden this may place on SMEs and startups active in the lower portion of the CMA chart, inundated with paperwork and facing relatively high compliance costs to navigate the new landscape. 

The resources that companies like ours will need to allocate to stay abreast of enforcement may detract from other potential contributions to the framework governing AI technology development in the years ahead, such as participation in the standardisation development process. Lastly, a note on a broader yet relatively underdeveloped issue in the legislation: who within the supply chain will be held accountable for damages caused by high-risk AI products or systems? Legal clarity regarding liability is crucial for fostering productive conversations among stakeholders in the AI value chain, particularly between developers and deployers. 

Let’s hope that future iterations of the AI regulatory framework will effectively distribute responsibilities among them, ultimately leading to a fair allocation.

Questions to Guide the Reader

What is the significance of the World AI Cannes Festival (WAICF) for AI innovators and decision-makers?

The festival stands as a premier platform for the exhibition and discourse of AI advancements and strategies. Attending this event offers a unique opportunity to delve into cutting-edge applications, connect with key players across the AI value-chain, gain insights into their business strategies, and participate in high-level discussions exploring the evolving intersections of humanity and AI

How does the anticipated AI regulation in Europe impact innovation and the digital market landscape?

The latest version of the AI Act reflects over two years of negotiations involving political and business stakeholders in the field. The inclusion of broad exemptions for free and open-source AI models, coupled with the adoption of the proportionality principle for SMEs, presents a potential avenue for regulating AI technology’s risks without impeding innovation. However, the true impact will only become evident during implementation. Concerns arise regarding compliance costs, particularly for smaller entities, and the lack of legal clarity surrounding liability, which is vital for facilitating constructive dialogues among stakeholders in the AI value chain, particularly between developers and deployers.

What are WordLift’s strategies for aligning with ethical AI practices and upcoming regulations?

Since its inception, WordLift has adopted a proactive approach characterised by a commitment to ethical AI. Building upon this foundation, the company is now actively preparing for regulatory compliance by articulating a comprehensive framework based on four pillars.

How might the consolidation of the market for Foundation Models (FM) affect SMEs and startups in the AI sector?

As larger companies acquire dominance in the market for FMs, SMEs and startups may face greater hurdles in accessing these foundational technologies, potentially leading to increased dependency on them. This could pose a risk of stifling innovation over time. Regulators must closely monitor upstream markets to prevent a reduction in the diversity of business models, ensuring that smaller players retain flexibility in utilising multiple FMs and holding FM providers accountable for the outputs they generate.

Detecting AI-Generated Content: 6 Techniques to Distinguish Between AI vs. Human-Written Text

Detecting AI-Generated Content: 6 Techniques to Distinguish Between AI vs. Human-Written Text

In the ever-evolving landscape of…

Just kidding. But seriously. If you’ve seen an article or blog post starting with similar verbiage, odds are artificial intelligence (AI) is the true author of the text.  

Undoubtedly, AI is disrupting nearly every industry in one way or another, driving the stock market to all-time highs.

Generative AI tools are helping companies, employees, and contractors streamline tedious processes. For example, freelance writers and bloggers who use AI say they spend 30% less time writing a blog post.

But this comes with a caveat. AI-generated content isn’t perfect. It often lacks style, personality, and emotion. And it can also get facts wrong and make things up, in a phenomenon now called Al hallucination.

While most ‌would expect artificial intelligence to take the “word of the year” title in 2023, gee was the actual winner. Ironic, right? 

If you do decide to use AI to write your content, it’s important to make sure it doesn’t feature the classic hallmarks of AI. Otherwise, you give away your content strategy within the first few paragraphs (or words). 

So, knowing how to detect AI-generated content is the secret to striking the perfect balance between humans and machines. 

6 Ways to Detect AI-Generated Content

Here are six simple ways you can detect AI-generated content from a mile away. 

1. Proofread the Content

AI is highly efficient in producing content. But that comes at a cost. Often, the content is repetitive. There’s no real voice. And in some cases, the AI you use might make things up as it goes along.

That’s because generative AI tools like ChatGPT are trained on a huge amount of data dating back to a certain date. So, talking about current events or information after that date may produce inaccurate results.

That’s why it’s so important to proofread AI content. Seriously, don’t skip this step. All content should feature a natural, simple tone, avoid repetition, and provide accurate information.

Look out for obvious incorrect outputs, such as this attempt at ‌an Amazon product description:

Instead of trying to save time using AI to write product descriptions, leverage WordLift to help you optimize your structured data to boost your chances of landing featured snippets. 

(Image Source)

Doing so will pay off in the long run and drive more traffic to your website. More traffic means more opportunities to convert leads into sales.

Now, let’s round out this first step. Your proofreading should go beyond just checking for grammatical mistakes. In fact, with AI, you’re likely to find zero typos and no grammatical errors at all.

Instead, you might find extremely fancy vocabulary with too much jargon. So, it’s important to look out for this, too. 

That’s where proofreading services, primarily driven by skilled human editors, become invaluable. These services excel in identifying and fixing errors or inconsistencies that novice editors (or AI tools) might overlook.

2. Look for a Flat, Robotic Tone

Because AI writers are powered by, well, artificial intelligence, they lack a human voice. As a result, there may be a lack of personal opinion or emotion. 

Let’s look at this example. Let’s say that you’re a digital marketing agency, and you ask ChatGPT to write two to three sentences on the importance of digital marketing. 

Here’s the response it generates.

Screenshot provided by the author.

When reading this content, can you detect a personality or unique voice?

Probably not. The AI provides some pretty good information on digital marketing. By reading it, the average person can understand the true value of digital marketing.

But if you’re a brand that wants to get your message across in a way that makes you memorable and relatable, this is probably not content you want to share with your audience. 

Why? It’s very monotone. Plus, it lacks emotion and depth. 

Now, let’s look at an example of content with some personality. It’s the same topic but written in a more upbeat, relatable tone:

“We work, shop, and play in a digital world. You can’t afford to not use digital marketing strategies to get noticed and build brand recognition. 

We’re talking strategies like social media promotion, search engine optimization, and email marketing to get your message across and let customers know your brand is here and here to stay.

And because you’re marketing your brand online, you can quickly adapt to changing customer preferences. How? Thanks to data-driven insights that help you continuously improve your strategies.”

See the difference? The brand is talking directly to the audience. It uses relatable language: “You can’t afford…”, “We’re talking,” and “Here to stay.” 

While we’re on the topic of emotionless writing, let’s use HRIS software as another example. This software details payroll calculations and benefits packages, which may seem purely technical at a surface level. 

Humans are the key to integrating that information with: 

  • Anecdotes about employee success stories
  • The use of irony or everyday jargon
  • Quotes from satisfied users

This personal touch, even in a technical context, goes beyond just conveying facts. It offers a human-centered picture of the software’s impact. 

Why is that so important? Connecting with readers on an emotional level makes a lasting impression. And that’s exactly what an HRIS software company wants when emphasizing the human benefits of streamlining HR processes.

In short, a skilled writer can transform a dry manual into a relatable narrative, showing the value of the human touch even in AI-generated content.

3. Use an AI Content Detector

As we’ve touched on, your first task is to manually go through your content to make sure it doesn’t scream, “I was written by AI.” 

Sometimes, it’s a little less obvious, and you need some help sniffing out that AI content. 

Thankfully, you can also use an AI content detector to identify areas that feature AI-generated content characteristics, like repetitiveness and lack of tone and voice.

So, how do AI content detectors work? 

They’re trained on human and AI-generated text to tell the difference between the two. But of course, they’re not always accurate.

(Image Source)

Nonetheless, here are some of the characteristics they look for when detecting AI-generated content:

  • Perplexity: This measures how predictable the content is. AI-generated content tends to have low perplexity. Human writing usually has higher perplexity, which results in more creative and complex language choices.
  • Burstiness: This measures the variation in the length and structure of sentences. AI content usually has low burstiness, meaning there’s little variation in sentence structure and length. That’s because language models tend to predict the most likely word to come next, which makes the length of sentences and their structure more predictable, hence why AI can sometimes be monotone. 

Of course, these traits aren’t always true for AI-generated content. Some AI writers are skilled at mimicking human language and tone. 

This makes it difficult to detect AI-generated content, which leaves us in a gray area where we may easily mistake human-crafted articles for AI-made content and vice versa.

Take Cruise America, a Phoenix RV rental company, in their article “13 Travel Goals to Check Off in 2024.” Its crisp simplicity and practical information could lead one reader to assume a human touch, while another might suspect AI.

It can be tough to tell the difference. But AI-detection tools like Undetectable (Forbes’ #1 pick) can help you crack the code. 

Screenshot provided by the author.

With a 90% accuracy score, according to Forbes, Cruise America passes the AI content test. The result? We can confidently say the text was written by a human. 

4. Fact Check The Content

Distinguishing between AI-generated and human-written text is only getting more and more challenging as venture capital and investor money pour into this technology. 

(Image Source)

Now, advanced AI models can generate highly realistic and coherent content.

However, there are some more simple techniques to help make this distinction. We’ve already touched on basic proofreading. Now, it’s time to check for contextual understanding or unusual or inaccurate information.

For instance, we can apply some of these observations to this article on alternatives to Ozempic for effective weight management, which could be a candidate for AI-generated content due to the complex topic. 

For context, here’s a screenshot of the article.

(Image Source)

Here are some things to consider when trying to determine if the content is written by a human, using the above article as an example:

  • Specific information and details: The article details Ozempic, how it works, who it’s for, how to take it, potential side effects, and its cost. This depth of information is typically associated with human-generated content.
  • Use of citations: The article references percentages and information from clinical trials, suggesting a reliance on factual information. Proper citation is a common feature in human writing.
  • Contextual understanding: The text demonstrates a reasonable understanding of the subject, discussing Ozempic and its use in treating Type 2 Diabetes and weight loss, referencing the current interest in the drug. This suggests a level of contextual awareness.

Whether you’ve used an AI writing tool and want to check your own work or you want to see if someone else has used AI, do a quick fact-check.

If you’re not an expert on that particular topic, you can leverage AI SEO Agent by WordLift. With its new ability to do fact-checking for you, you can validate claims and reduce the risk of incorporating incorrect hallucinations into your content. 

This feature is game-changing because publishing inaccurate content can make you appear less trustworthy and alienate your audience.

5. Look for Repetitive Patterns in the Text

If you’ve used AI writing tools like ChatGPT before, then you’re probably familiar with how AI tends to repeat itself, but in different wordings or phrasings.

Screenshot provided by the author.

Notice how, in the above example, when asked to write a paragraph about eating healthy, the output from ChatGPT repeats the word “offer” or “offering” throughout the text. 

Although the content is informative and shares some valuable tips, it repeats itself and doesn’t vary its word choices and sentence structure. 

Remember, AI models are designed to be cautious and neutral in their outputs, which may result in more conservative language patterns. And this is what makes AI content sometimes look repetitive.

6. Run a Plagiarism Check

AI-generated content lacks the creativity and originality of human writing. That’s because it’s trained on content written by people all over the web. 

As a result, AI writing may include identical or similar sentences from other publishers.

So, if you run a plagiarism check on a piece of content and it comes back with results, it’s possible that the content was AI-generated.

Screenshot provided by the author.

Learn to Detect AI-Generated Content to Build Brand Credibility

While AI content detectors are valuable tools, they aren’t 100% accurate. So, training the human eye to detect AI-generated content is crucial.

Key red flags are repetitiveness, lack of personality, and inaccurate information.

Sure, some AI-written text can pass as human writing. But you’ll become better at telling the difference when you know what to look for.

Use these tips and tricks we’ve shared today, and you’ll be able to detect AI-generated content from a mile away. Say goodbye to poorly written content and hello to engaging, human-written content that converts.

Happy editing!

Fact-Checking in the Age of AI: Navigating Truth, Entities, and SEO

Fact-Checking in the Age of AI: Navigating Truth, Entities, and SEO

“It is the mark of an educated mind to be able to entertain a thought without accepting it.”

Aristotele – True or False?

In an era where the veracity of information is constantly scrutinized, the quote above is an interesting example, as it often gets misattributed to Aristotle. This clarifies the critical need for fact-checking. Fact-checking, I have to admit, is not on the trending side of AI, and most SEOs have, in general, little or no knowledge of its significance. Nevertheless, here at WordLift, we build knowledge graphs that aid machines in discerning facts, and as the new year begins, I decided to take a few steps in this direction. With the help of our fantastic team, I built an API to help publishers and e-commerce platforms semi-automate fact-checking. 

Here is the index for this article. Feel free to skim through.

  1. What is fact-checking?
  2. History of the ClaimReview Markup and Google’s Involvement
  3. ClaimReview markup explained
  4. Fact-Checking for E-commerce Websites
  5. Semi-Automating Fact-Checking with an AI Agent [code available]
  6. Epistemology, Ethics, and SEO
  7. Conclusions and future research work
  8. References

Let’s run a first test (and yes, I will also share the code behind it so that anyone can extend it and improve it). 

Ok, let’s take a step back first. As we navigate the murky waters of misinformation, the role of fact-checking becomes critical in preserving the integrity of discourse and knowledge. With the advent of AI-generated content, the challenge of discerning truth from falsehood has taken on completely new dimensions, demanding more sophisticated, scalable, and reliable verification methods. This intersection of technology and truth-seeking is reshaping the landscape of fact-checking, making it an essential tool for individuals, online publishers, shop owners, and search engines in the quest for accuracy and SEO relevance. 

What Is Fact-Checking?

Fact-checking is the rigorous process of verifying factual assertions in the text (or in media) to ensure the accuracy and credibility of the information presented. This practice has a storied history, tracing back to the early 20th century when magazines began employing fact-checkers to verify what journalists wrote. Over time, fact-checking has evolved from a publishing safeguard to a journalistic specialty, particularly in politics.

In journalism and digital media, fact-checking involves a meticulous process that includes cross-referencing information with credible sources, consulting databases, and sometimes conducting interviews with subject matter experts. Journalists and fact-checkers work in tandem to uphold the integrity of the content, a task that has become increasingly complex with the proliferation of digital platforms where anyone can publish information.

The impact of misinformation cannot be overstated. In our hyper-connected society, false information spreads rapidly, influencing public opinion and shaping political and social discourse. This alpine panorama, for example, doesn’t exist. It is inspired by the beautiful mountains of SalzburgerLand, but created using Midjourney with a simple prompt:

“Equirectangular photograph of a mountain landscape in SalzburgerLand”. 

The consequences of misinformation are far-reaching, affecting everything from public health to democratic processes. 

History of the ClaimReview Markup and Google’s Involvement

Google has played a pivotal role in developing the ClaimReview markup since its early inception in 2015. The initiative for ClaimReview began when Google, in collaboration with fact-checking leaders like Glenn Kessler from the Washington Post, Bill Adair from Duke Reporters’ Lab, and Dan Brickley 👏 from Schema.org, began to address the challenge of identifying and verifying factual information in the digital news environment.

The primary goal of Google has always been to enable the infrastructure to categorize and identify fact-checking content on the Web systematically.

2016: Introduction of the ‘Fact Check Tab’

Google introduced the ‘Fact Check Tab’ in 2016, a crucial year marked by the U.S. presidential election. This strategic move provided users easy access to fact-checked information during heightened political activity and information dissemination.

Early 2017: Enabling Publishers with Fact Check Tag

Google announced the integration of a fact-check tag in its search and news results. This feature was not about Google conducting its fact checks but aggregating and highlighting fact checks from authoritative sources like PolitiFact and Snopes.

Publishers wishing to feature in these fact-checked sections had to use the ClaimReview markup and adhere to the fact-check guidelines. Google emphasized already at that time that only publishers recognized algorithmically as authoritative sources of information would be eligible for inclusion. In 2017, Bing also started to feature on its SERP the fact check label for articles containing the ClaimReview markup. 

Here is how the fact check label looks like now on Bing.

2019: Google’s Fact-Checked Articles: A Significant Reach and New Tools

The ClaimReview markup starts to gain traction. As of late 2019, Google served over 11 million fact-checked articles per day, as highlighted on SEJ. This considerable content volume, including global search results and Google News in five countries, translates to approximately 4 billion impressions annually. Google’s efforts in this direction, already in 2019, resulted in the creation of a publicly available search tool (the Fact Check Explorer) containing a database of over 40,000 fact checks. This tool became a significant resource for users seeking verified information. Along with the Fact Check Explorer, Google introduces the Markup Tool to let publishers add the claim even without adding the markup to their pages. 

Here is an example of a claim made by Express Legal Funding that appears in the Fact Check Explorer. 

2021: One Page, One Claim. Google’s Eligibility Criteria for Fact-Check Rich Results

In July 2021, as spotted by Roger Montti on SEJ at that time, Google updated the eligibility criteria for pages to qualify for fact-check rich results using ClaimReview structured data. This change represented a fundamental shift in how Google displays fact checks in search results. As more data becomes available, Google’s commitment to clarity and simplicity impacts the eligibility criteria for the rich results. Previously, Google allowed multiple fact checks on a single page, meaning a webpage could cover multiple fact checks on different topics. 

Following the updated guideline, to be eligible for the fact check rich result, a page must only have one ClaimReview element. A page with multiple ClaimReview elements will no longer qualify for the rich result feature. The only exception is when the webpage hosts multiple fact-checks about the same topic from different reviewers. Also, in the same year, Google introduced support for MediaReview, a new taxonomy being developed by the fact-checking community to highlight if a video or image has been manipulated (more information on What is MediaReview).

The Fact-Check label on Google Image Search.

2022: Google introduces the Fact Check Tool API and continues investing in Fact-Checking

The new API tool allows users and developers to query the same Fact Check results via the Fact Check Explorer. You can call this API continuously to get the latest updates on a particular query or claim. 

More importantly, as YouTube fact-check panels begin to appear, both Google and YouTube commit $13.2 million to the International Fact-Checking Network for the Global Fact Check Fund to fight misinformation. Fact-checking becomes available on: 

  • Google Search (and GSE), 
  • Google News, 
  • Google Image Search, and
  • YouTube search results.

The “About this results” was introduced in late 2022 as an additional feature part of the same initiative to help users evaluate the context and helpfulness of a website.

2023: Fact-Checking Becomes Multimodal

In August 2023, Google introduced a beta version of its Image Search feature for approved testers. This feature allowed users to search for fact-checks related to a specific image. This advancement represented a significant step in Google’s efforts to combat misinformation, particularly in visual content. Google now also provides context and a timeline for images on the web, showing when they were first indexed by Google and the associated topics. We start to see (also on Google’s front-end interfaces) the interaction between topics and entities in the Knowledge Graph and fact-checking claims.

The entity Donald Trump is associated with a ClaimReview (and no, Michael Moore doesn’t support Trump’s 2024 election campaign).

Also, in 2023, Google added support for the ‘About this image’ feature to learn more about an image and its veracity. As part of the same update, the Fact Explorer becomes capable of displaying ClaimReview data behind image URLs.

Fact-checking in Google Search Generative Experience (SGE)

These new features have also been introduced as an integral part of Google’s SGE. We read on Google’s Blog about fact-checking images:

“For people who are opted-into Search Generative Experience (SGE) through Search Labs, you’ll now be able to see AI-generated descriptions of some sources, supported by information on high-quality sites that talk about that website. We’ll showcase links to these sites in the AI-generated description of the source.

These AI-generated descriptions of the source will show up in the “more about this page” section of About this result for some sources where there isn’t an existing overview from Wikipedia or the Google Knowledge Graph.”

Here, above is the SGE expansion on a Fact Check article. The article acts as the primary source for the Claim Review along with another piece from the same publisher.

ClaimReview Markup Explained

The markup allows us to share the review of a claim made by others. The key properties, based on Google’s guidelines, are:

  • claimReviewed: This is the core of the ClaimReview markup. It concisely describes the claim being evaluated. For example, a statement like “Beatrice Gamba is the Head of Innovation at WordLift.” (Big news by the way and congratulations to Bea and her team!).
  • Claim: It is the factually-oriented claim that could be the itemReviewed in a ClaimReview. The content of a claim can be summarized with the text property. Variations on well-known claims can have their common identity indicated via sameAs links and summarized with a name. It needs to be unambiguous.
  • itemReviewed: This property describes the manifestation of the claim (evidence where the Claim being reviewed appeared). It usually has its own set of nested properties:
    • @type: Typically “CreativeWork” such as a news article or blog post.
    • url: The url of the item reviewed.
    • author: The person or organization making the claim
  • reviewRating: This property evaluates the claim. It includes several sub-properties:
    • @type: Always set to “Rating”.
    • ratingValue: A numerical score given to the claim, for example, “4”.
    • bestRating and worstRating: These define the rating scale (e.g., 1 to 5).
    • alternateName: A textual representation of the rating, such as “True,” “False,” or “Partially True”.
  • author: The entity responsible for the fact check. It usually has sub-properties like:
    • @type: Often “Organization” or “Person”.
    • name: The name of the organization or person conducting the fact check.
  • datePublished: The date when the fact-checking article or report was published.
  • url: The URL of the fact-checking article or report. 

Here is an example of how we can now automatically fact-check a claim such as “Beatrice Gamba is the Head of Innovation at WordLift.” using WordLift API (or directly Agent WordLift). The API will return the following snippet:

{
  "@context": "http://schema.org",
  "@type": "ClaimReview",
  "claimReviewed": "Beatrice Gamba is the Head of Innovation at WordLift",
  "author": {
    "@type": "Organization",
    "name": "WordLift"
  },
  "datePublished": "2024-01-15",
  "reviewRating": {
    "@type": "Rating",
    "ratingValue": "4",
    "alternateName": "Mostly true",
    "bestRating": "5",
    "worstRating": "1"
  },
  "url": "https://fact-check.wordlift.io/review/beatrice-gamba-title",
  "reviewBody": "Based on multiple sources, Beatrice Gamba does indeed work at WordLift. Her title is shown to be Head of Agency or SEO strategist in most records. However, one source stated her title as Head of Innovation. Therefore, the statement 'Beatrice Gamba is the Head of Innovation at WordLift' is mostly accurate, but her exact title may vary.",
  "itemReviewed": {
    "@type": "CreativeWork",
    "url": [
      "https://wordlift.io/blog/en/entity/beatrice-gamba/",
      "https://it.linkedin.com/in/beatrice-g",
"https://wordlift.io/entity/beatrice-gamba/",
      "https://theorg.com/org/wordlift/org-chart/beatrice-gamba",
      "https://www.knowledgegraph.tech/speakers/beatrice-gamba/"
    ]
  }
}

You can review it directly on the Rich Result Testing tool here. Because when the check was done, the news was still fresh and only propagated on WordLift’s website, the statement appears to be “Mostly True,” as highlighted in the reviewBody above.

If you are interested in the original definition of the markup, I suggest reading Dan Brickley’s original description of the key concepts behind fact-checking on the schema.org GitHub.

Fact Check Eligibility Criteria

Here is how the rich results look like on the SERP of Google when the ClaimReview markup is correctly applied and indexed. 

ClaimReview Rich Result on Google Search.

When the intent is specific this might trigger (as in the example below) a feature snippet.   

Interestingly enough we do expect premium visibility on Google Search Generative Experience. Here the page containing the claim from our client is presented as a primary source of information in the generative response associated with the Fact-Check.

What steps should we take to be featured? What are the essential requirements for eligibility to appear on Google’s ClaimReview rich result? Here is a brief summary of the criteria. 

Structured Data Requirements

  • Your site must have multiple pages marked with ClaimReview structured data.
  • Structured data must accurately reflect the content on the page (e.g., both structured data and content should agree on whether a claim is true or false).

Content and Website Standards

  • Compliance with standards for accountability, transparency, readability, and avoiding site misrepresentation as per Google News General Guidelines.
  • Presence of a corrections policy or a mechanism for users to report errors.
  • Political entities such as campaigns, parties, or elected officials are not eligible for this feature.
  • Clear identification of claims and checks in the article body, making it easy for readers to understand what was checked and the conclusions reached.
  • The specific claim being assessed must be clearly attributed to a distinct origin (e.g., another website, public statement, social media) separate from your website.
  • Fact check analysis must be transparent and traceable, with citations and references to primary sources.

Technical Guidelines

  • A page is eligible for a single fact check rich result and must contain only one ClaimReview element.
  • The page hosting ClaimReview must include at least a brief summary of the fact check and its evaluation, if not the full text.
  • A specific ClaimReview must only appear on one page of your site. Do not duplicate the same fact check on multiple pages, except for variations of the same page (e.g., mobile and desktop versions).
  • If aggregating fact-check articles, ensure all articles meet these criteria and provide an open, publicly available list of all fact-check websites you aggregate.

Structuring Claims in a Knowledge Graph

In examples like this one, in the context of linked data and knowledge graphs, we can also reference Beatrice Gamba in a more subject-oriented way, which can be particularly useful if the fact check is directly related to her as a person or her role. We do this by leveraging the schema:about property. We also use the ‘@id’ property as a way to uniquely identify Beatrice (the entity of a Person) within JSON-LD structured data. It helps to specify the unique identifier of an entity, providing a clear reference to an external resource or a node in the knowledge graph. 

Here is the markup now with the addition of the schema:about property and here is how it renders on the Google Structured Data Testing Tool:

{
  "@context": "http://schema.org",
  "@type": "ClaimReview",
  "claimReviewed": "Beatrice Gamba is the Head of Innovation at WordLift",
  "author": {
    "@type": "Organization",
    "name": "WordLift"
  },
  "datePublished": "2024-01-15",
  "reviewRating": {
    "@type": "Rating",
    "ratingValue": "4",
    "alternateName": "Mostly true",
    "bestRating": "5",
    "worstRating": "1"
  },
  "url": "https://fact-check.wordlift.io/review/beatrice-gamba-title",
  "about": { 
       "@type": "Person", 
       "@id": "http://data.wordlift.io/wl0216/entity/b-23977", 
       "name": "Beatrice Gamba" },
  "reviewBody": "Based on multiple sources, Beatrice Gamba does indeed work at WordLift. Her title is shown to be Head of Agency or SEO strategist in most records. However, one source stated her title as Head of Innovation. Therefore, the statement 'Beatrice Gamba is the Head of Innovation at WordLift' is mostly accurate, but her exact title may vary.",
  "itemReviewed": {
    "@type": "CreativeWork",
    "url": [
      "https://wordlift.io/blog/en/entity/beatrice-gamba/",
      "https://it.linkedin.com/in/beatrice-g",
      "https://wordlift.io/entity/beatrice-gamba/",
      "https://theorg.com/org/wordlift/org-chart/beatrice-gamba",
      "https://www.knowledgegraph.tech/speakers/beatrice-gamba/"
    ]
  }
}

Adding the “about” property in ClaimReview markup, especially when combined with an “@id” attribute that links to a unique entity in a knowledge graph, can significantly enhance the capabilities of data querying and retrieval, particularly with technologies like GraphQL. We can, for example, give an entity (whether is a Person, an Organization, or a Product), and collect with a single query all the available ClaimReview. 

Fact-Checking for E-commerce Websites

Fact-checking, while prominently used in the context of news and information, is not limited to these areas alone, and it can also be applied to products. This is particularly relevant when building Product Knowledge Graphs. We can foresee value on multiple sides:

  • Products often come with various claims regarding their effectiveness, ingredients, environmental impact, health benefits, and more. Fact-checking these claims can be a  crucial approach to ensure that they are accurate and not misleading to the consumers.
  • Conversely, there are instances where media reports about products may contain misleading information with limited substantiation. In such cases, fact-checking becomes essential to validate the veracity of these media claims.

AI Fact-Checking for Product Descriptions

At WordLift, while seeding product data for creating Product Knowledge Graphs, we’re using fact-checking to verify the accuracy of product descriptions and specifications listed on our clients’ e-commerce platforms. Here are a couple of practical examples: 

ImageProduct DescriptionClaimAI Fact-Checking
Ready for all those travel adventures you have planned? Don’t forget to pack a pair of Gucci shoes. Crafted from ECONYL® – a nylon fabric made with recycled yarn – this GG Supreme canvas pair not only adds style to your journey but also lets you extend environmentally friendly practices to your travels. Step forward with your best foot in these Italian-made shoes.Women’s Gucci Off The Grid sneaker uses recycled materials.{
  “@context”: “http://schema.org”,
  “@type”: “ClaimReview”,
  “claimReviewed”: “Women’s Gucci Off The Grid sneaker is made from recycled materials”,
  “author”: {
    “@type”: “Organization”,
    “name”: “WordLift”
  },
  “datePublished”: “2024-01-18”,
  “reviewRating”: {
    “@type”: “Rating”,
    “ratingValue”: “5”,
    “alternateName”: “True”,

    “bestRating”: “5”,
    “worstRating”: “1”
  },
  “url”: “https://fact-check.wordlift.io/review/womens-gucci-off-the-grid-sneaker-recycled-materials”,
  “reviewBody”: “Gucci has indeed launched a product line called ‘Off The Grid’ which includes the Women’s Gucci Off The Grid sneaker, and it is promoted as being made from recycled materials. This is confirmed by multiple reliable sources. The company states that the product line uses recycled, organic, bio-based, and sustainably sourced materials.”,
  “itemReviewed”: {
    “@type”: “CreativeWork”,
    “url”: [
      “https://wwd.com/feature/gucci-launches-off-the-grid-sustainable-collection-1203652671/#!”,
      “https://www.gucci.com/us/en/st/capsule/circular-line-off-the-grid”,
      “https://the-ethos.co/gucci-sustainable-run-sneakers/”,
      “https://www.gucci.com/us/en/st/stories/article/off-the-grid-collection-shoppable”,
      “https://www.elle.com/fashion/a33014463/gucci-off-the-grid-sustainable-collection/”
    ]
  }
}
ImageProduct DescriptionClaimAI Fact-Checking




Un fond de teint tenue 16 heures boosté par le pouvoir de l’éclat, qui conjugue couvrance moyenne à intense et fini naturel. Longue tenue rime avec légèreté. Haute couvrance et résultat ultra-naturel. Sa formule respirante et résistante est infusée aux extraits de framboise, de pomme et de pastèque pour aider à lisser et améliorer l’apparence de la peau instantanément et au fil du temps, et révéler un teint éblouissant de luminosité. Elle tient mieux, plus longtemps, et offre un résultat toujours plus beau à mesure que les heures passent. La technologie spéciale « accord parfait » vous assure une nuance au plus proche de votre carnation.
NARS Natural Radiant Longwear Foundation Mali contient des extraits de framboise, de pomme et de pastèque.{
  “@context”: “http://schema.org”,
  “@type”: “ClaimReview”,
  “claimReviewed”: “NARS Natural Radiant Longwear Foundation Mali ingredients”,
  “author”: {
    “@type”: “Organization”,
    “name”: “WordLift”
  },
  “datePublished”: “2024-01-19”,
  “reviewRating”: {
    “@type”: “Rating”,
    “ratingValue”: “5”,
    “alternateName”: “True”,

    “bestRating”: “5”,
    “worstRating”: “1”
  },
  “url”: “https://fact-check.wordlift.io/review/nars-natural-radiant-longwear-foundation-mali-ingredients”,
  “reviewBody”: “The ingredients of NARS Natural Radiant Longwear Foundation in the shade Mali include Dimethicone Crosspolymer, Bis-Butyldimethicone, Polyglyceryl-3, Stearic Acid, among others. It is indeed a liquid foundation with a natural finish and full coverage. Thus, the claim about its ingredients is correct”,
  “itemReviewed”: {
    “@type”: “CreativeWork”,
    “url”: [
      “https://www.narscosmetics.com/USA/natural-radiant-longwear-foundation/999NAC0000065.html”,
      “https://www.temptalia.com/product/nars-natural-radiant-longwear-foundation/mali/”,
      “https://incidecoder.com/products/nars-natural-radiant-longwear-foundation”,
      “https://www.narscosmetics.co.uk/en/mali-natural-radiant-longwear-foundation/0607845066323.html”,
      “https://www.skincarisma.com/products/nars/natural-radiant-longwear-foundation/ingredient_list”
    ]
  }
}

We use AI to analyze product information on an attribute-by-attribute level and create a single source of truth for our clients’ organizations. This allows them to control any Generative AI workflow (product description, chatbots, content recommendations and more).

Semi-Automating Fact-Checking with an AI agent [code available]

Open In Colab

In this section I will share the code to set up your own AI Agent for fact-checking 🎉 . This section is for Python developers only (sorry), feel free to jump right at the end of it to read the conclusions or stick with me to understand the workflow.

Here is the flow in the Colab but for the SPARQL/GRAPHQL tool.

The code creates an AI agent for fact-checking, leveraging OpenAI’s function calling. The notebook begins by installing essential libraries such as llama-index (I love LlamaIndex), llama-hub, and tavily-python (this is an interesting gem for this type of project), which are integral in building the pagent’s capabilities. You will need:

  • An OpenAI key 
  • A Tavily-API key (here)

Following this, it imports various modules necessary for JSON data handling, defining types, and interacting with OpenAI’s API, among other functionalities. This setup is crucial for the agent to process and verify information efficiently. 

The agent is designed to receive queries in the form of claims, process them, and return fact-checked information. 

By using advanced LLM models, the agent can analyze text data, cross-reference information with reliable sources, and provide validated answers. We are planning to extend the capabilities of the Agent by accessing (when available) to data inside the knowledge base. We can chain another tool and run a SPARQL query on the RDF graph to bring first-party information into the evaluation process. 

Epistemology, Ethics, and SEO

Epistemology, the philosophical study of knowledge, intersects with SEO in the quest for understanding and optimizing the acquisition and dissemination of information on the web. 

When doing SEO, we streamline the workflow of a search engine to increase the visibility of a content piece; by doing so, we’re acting on what people see and understand of the World. Ethically, we want to choose where and how to direct our powers. We want to keep things balanced and society as healthy as possible. Fact-checking powered by AI assists in the pursuit of truth and understanding. It can also limit the use of bad marketing techniques and the publishing of low-quality content.

As professionals in the Search Engine Optimization (SEO) field, we are always adapting to the changing landscape of information. We focus on understanding the context and relationships between entities in the real world. ClaimReview, especially when associated with entities, helps search engines determine the truthfulness of information. We can also use fact-checking to assess the accuracy of statements we promote.

In the example above about Beatrice becoming Head of Innovation at WordLift, we can see how this methodology helps us review the extent to which Google has perceived this change in its top search results. 

Consider another scenario: with the increase in AI-generated content, it becomes imperative to bolster our infrastructure to evaluate the truthfulness of such information. Many may recall the incident following the debut of Google’s Bard through a promotional video, which resulted in a staggering $100 billion market value loss for Google. This drop occurred after researchers on Twitter pointed out that Bard had disseminated incorrect information, claiming that the James Webb Space Telescope (JWST) was the first to photograph an exoplanet. Here’s how Agent WordLift, an AI SEO Agent, can identify the erroneous statement about the JWST that tripped up Bard.

Agent WordLift (AI SEO Agent) and its new ability to do fact-checking. We can now validate claims and reduce the risk of hallucinations.

Conclusions And Future Research Work

The landscape of fact-checking is rapidly changing with the advent of AI and the proliferation of misinformation. The need for robust, scalable, and reliable fact-checking methods is more crucial than ever in journalism and across various digital platforms. 

Agent WordLift (AI SEO Agent) is in action with fact-checking.

Fact-checking is not just about ascertaining truth; it’s increasingly intertwined with SEO. The credibility and authority of content (E-E-A-T), vital for SEO success, can be enhanced through meticulous fact-checking in various ways. Google and Bing continue investing in fact-checking tools and structured data to improve the quality of information presented in search results across multiple platforms like Google Search, Google News, Google Images, Google Generative Search Experience, and YouTube.

AI plays a crucial role in automating and improving the accuracy of fact-checking processes, and its applications extend to other areas such as SEO, where autonomous AI agents are beginning to make an impact.

With the help of AI Agents like the one presented in this article, we can structure claims in knowledge graphs to bolster content integrity and improve SEO. 

On the technical side I envisage the future evolutions of our tooling in this area as follows:

  • Firstly, enhancing the versatility of linking ClaimReviews to entities is a pivotal challenge. This involves refining how our tools identify and associate claims with relevant entities in a more dynamic and context-aware manner. 
  • Secondly, there’s a need for more sophisticated mechanisms to determine the trustworthiness of websites. This requires restricting and evaluating the list of sources when analyzing information. 
  • Another critical aspect is the integration of data from Knowledge Graphs. Our agent should be adept at extracting and comparing this data with web search results, enabling a more comprehensive and accurate verification process. 
  • Lastly, the aspect of multimodality cannot be overlooked. The ability to validate along with text, images, and other media files is essential in a digital landscape increasingly dominated by diverse content formats. 

Let’s raise the open web standards together and make the information in our knowledge graphs more trustworthy!   

References