Table of content:
- The rise of AI content and the need for knowledge, common sense, and generative technologies
- A more cost-effective approach to Generative AI: augmenting LLMs with knowledge graphs
- The new regulatory frameworks on AI: more (secure) data and fewer parameters!
- Large Language Models as Reasoners: How we can use LLMs to enrich and expand Knowledge Graphs
- Combining LLMs and KGs: a couple of SEO use cases
- Conclusion: Unfolding the Blueprint of WordLift as a Generative AI SEO Platform
The Rise Of AI Content And The Need For Knowledge, Common Sense, And Generative Technologies
Returning from New York, where I attended the Knowledge Graph Conference, I had time to think introspectively about the recent developments in generative artificial intelligence, information extraction, and search.
Modern generative search engines are becoming a reality as Google is rolling out a richer user experience that supercharges search by introducing a dialogic experience providing additional context and sophisticated semantic personalization. We have changed how we access and use information since the introduction of ChatGPT, Bing Chat, Google Bard, and a superabundance of conversational agents powered by large language models.
At the same time, as recently voiced by Computer scientist Yejin Choi during her talk (Why AI Is Incredibly Smart and Shockingly Stupid | Yejin Choi | TED), “AI is a Goliath.” Today’s applications depend on a third-party API that leverages massive transformer-based models that have been trained with trillions of words, huge investments, and a consistent environmental impact (according to Bloomberg training GPT-3 required 1.287 gigawatt-hours, this is the equivalent of the yearly electrical consumption of a community of 120 houses in the US).
As we progress, Google’s Search Generative Experience will mainly feature AI-generated content. Our company started automating and scaling content production for large brands during the Transformers era, which began in 2020. While we prioritize maintaining a good relationship between humans and technology, it’s evident that user expectations have evolved, and content creation has fundamentally changed already.
Additionally, there is a growing trend in the content industry toward creating interactive conversational applications prioritizing content quality and engagement rather than producing static content.
Achieving interactive quality content at scale requires deep integration between neural networks and knowledge representation systems. Yejin Choi suggests distilling symbolic knowledge to infuse language models with common sense and, like many others in the industry (unexpectedly including OpenAI’s CEO Sam Altman), promotes the adoption of smaller models that do not have to encapsulate the entire world’s knowledge.
It is also becoming evident that responsible AI systems cannot be developed by a limited number of AI labs worldwide with little scrutiny from the research community. Thomas Wolf from the HuggingFace team recently noted that pivotal changes in the AI sector had been accomplished thanks to continuous open knowledge sharing.
Generative AI is a powerful tool for good as long as we keep a broader community involved and invert the ongoing trend of building extreme-scale AI models that are difficult to inspect and in the hands of a few labs.
A More Cost-Effective Approach To Generative AI: Augmenting LLMs With Knowledge Graphs
Augmented data retrieval is a new approach to generative AI that combines the power of deep learning with the traditional methods of information extraction and retrieval. Using language models to understand the context of a user’s query in conjunction with semantic knowledge bases and neural search can provide more relevant and accurate results.
Computer scientist Zdenko “Denny” Vrandečić, co-founder of Wikidata and considered the godfather of Google’s Knowledge Graph, clearly demonstrated the different performances behind a simple query between ChatGPT, Google, and Wikidata (he used as the query “Who created the School of Athens?”).
There is a difference in computational resources and, therefore, in the cost behind the execution of the query. As Denny explained, OpenAI and Google rely on the world’s fastest and most expansive hardware, while Wikidata runs on a “semi-abandoned” single server. This is obvious from the technical point of view as models have to handle multiple layers with hundreds of billion potential outputs (one token after the other). On the contrary, a query on a KG (even large as Wikidata) is a simple lookup: a logarithmic operation across 100+ million entities.
The high costs associated with generative AI are related to the training (from millions to billions of dollars) of the model and its use, making LLMs inaccessible for many and unfriendly to the environment. Smaller models like PaLM 2 (the small version), Alpaca, BloomZ, and Vicuna can still be very effective when coupled with well-structured knowledge and neural search.
The emergence of relatively small models opens a new opportunity for enterprises to lower the cost of fine-tuning and inference in production. It helps create a broader and safer AI ecosystem as we become less dependent on OpenAI and other prominent tech players.
In one of my latest experiments, I used Bard (based on PaLM 2) to analyze the semantic markup of a webpage. On the left, we see the analysis in a zero-shot mode without external knowledge, and on the right, we see the same model with data injected in the prompt (in context learning).
LLMs, by design, hallucinate: Bard alone (left side) made five factual mistakes in the completion. It only committed one minor inaccuracy (right side) when provided with the correct data. This simple experiment tells us that we possibly don’t need the large version of Bard; we can use a smaller (and less expensive) language model to “narrate” the same piece of information well.
Clement Delangue, CEO and co-founder of HuggingFace, recently said: “More companies would be better served focusing on smaller, specific models that are cheaper to train and run.”
Extracting the right piece of content, of course, remains challenging. Where is the information coming from?
When creating semantically related links on e-commerce websites, we first query the knowledge graph to get all the candidates (semantic recommendations). We use vectors to assess the similarity and re-rank options, and at last, we use a language model to write the best anchor text. While this is a relatively simple SEO task, we can immediately see the benefits of neuro-symbolic AI compared to throwing sensitive data to an external API.
The New Regulatory Frameworks On AI: More (Secure) Data And Fewer Parameters!
AI technologies are transforming the way we access and use information. However, AI must be used responsibly and ethically if we want to create a safe and healthy environment.
That is why I support the need for a global AI regulatory framework. Such a framework ensures that AI is used in a way that benefits society and does not pose any undue risks.
This brings back attention to the AI value chain, from the pile of data behind a model to the applications that use it. It is hard to improve safety when things happen behind closed doors. As much as new models push the boundaries of what is possible, the natural moat for every organization is the quality of its datasets and the governance structure (where data is coming from, how data is being produced, enriched and validated).
An early overview of the proposals coming from both the US and the EU demonstrates the importance for any organization to keep control over security measures, data control, and the responsible use of AI technologies. In other words, I do expect, also, compliance with the upcoming regulations, less dependence on external APIs, and stronger support for open-source technologies. This basically means that organizations with a semantic representation of their data will have stronger foundations to develop their generative AI strategy and to comply with the upcoming regulations.
Large Language Models As Reasoners: How We Can Use LLMs To Enrich And Expand Knowledge Graphs
Capabilities of LLMs as Reasoners
Large language models (LLMs) have been trained on massive datasets of text, code, and structured data. This training allows them to learn the statistical relationships between words and phrases, which in turn allows them to generate text, translate languages, write code, and answer questions of all kinds.
In recent years, LLMs have also been shown to be capable of reasoning. This means that they can be used as a backbone of intelligent agents that understand and apply information from multiple sources. For example, we are using LLMs already on our website to:
- Audit and analyze the structured data from the homepage of any website. Here the LLMs interact with our APIs for structured data extraction and analysis,
- AI question answering from web pages,
- A chat-powered search for our documentation (docs.wordlift.io).
How LLMs Can Be Used To Extract And Organize Knowledge From Unstructured Data
Unstructured data is any type of data that does not have a predefined structure, such as text, images, and videos. This data type can be difficult to understand and process using traditional methods. However, LLMs can be used to extract and organize knowledge from unstructured data in a number of ways.
For example, in the AI question-answering tool an LLM is used to extract and identify entities and relationships in web pages.
Using entities, we can sift through QAs from different web pages.
Using LLMs to extract and organize knowledge from unstructured data, we can enrich the data in a knowledge graph and bring additional insights to our SEO’s automated workflows. As noted by the brilliant Tony Seale, as GPT models are trained on a vast amount of structured data, they can be used to analyze content and turn it into structured data.
Here is a quick example of how things work behind the scenes: a simple question is automatically answered and turned into a json-ld object.
Combining LLMs And Knowledge Graphs: A Couple Of SEO Use Cases
In practical terms, we can construct a neurosymbolic system by fine-tuning a language model with data from a knowledge graph or by guiding its predictions using in-context learning (data is extracted in real-time from a knowledge graph and incorporated into the prompt).
Let’s dive deeper into these two scenarios. Content is a dynamic entity capable of catering to long-tail intents such as “What are the opening hours of the park in front of restaurant ABC?” or “What are the celiac options for breakfast at restaurant ABC?” while adjusting to different interfaces. We are poised to revolutionize traditional publishing workflows in a world governed by generative AI. However, if the content is spontaneously generated, how do we retain control over the tone of voice? How do we nurture meaningful interactions with our audience when the AI is the author? As Denny Vrandečić eloquently puts it, the answer is that “In a world of infinite content, knowledge becomes valuable.“
In Layman’s terms, this implies that by employing semantically rich data, we can monitor and validate the predictions of large language models while ensuring consistency with our brand values. Google hasn’t stopped investing in its knowledge graph since it introduced Bard and its generative AI Search Experience, quite the opposite.
Publishers, store owners, and digital marketers should do the same and ask themselves: where is the data to power our next generative application? How can we optimize our content to train a better model? What is the source of this data? How can we validate the generated content against our editorial guidelines?
Creating personalized content demands a wide range of data, starting with training data. To fine-tune a model, we need high-quality content and data points that can be utilized within a prompt. Each prompt should comprise a set of attributes and completion that we can rely on.
In other scenarios, such as an e-commerce shopping assistant, we can leverage product metadata and frequently asked questions to provide the language model with the appropriate information for interacting with the end user. Whether we opt for fine-tuning, in-context feeding, or a blend of both, the true competitive advantage will not lie in the language model but in the data and its ontology (or shared vocabulary).
Creating product descriptions for product variants successfully applies our neuro symbolic approach to SEO. Data from the Product Knowledge Graph is utilized to fine-tune dedicated models and assist us in validating the outcomes. Although we maintain a human-in-the-loop system to handle edge cases and continually refine the model, we’re paving the way for content teams worldwide, offering them an innovative tool to interact and connect with their users.
Editors now discuss training datasets and validation techniques that can be applied to both new and existing content at an unprecedented scale. There is no coming back, and it is fascinating. Yet, while the underlying technology is similar, it is not like using ChatGPT from the OpenAI website simply because the brand owns the model and controls the data used across the entire workflow. It is about finding the correct prompt while dealing with hundreds of possible variations. I always share internally with our team the idea that the content editor (or the SEO) using our generation tool is like a deejay or an electronic music artist that constantly interacts with nobs, pedals, and plugins to find the right sound to craft the perfect experience for the audience.
No one has ever arrived at the prompt that will be used in the final application (or content) at the first attempt, we need a process and a strong understanding of the data behind it.
We are currently exploring various AI-driven experiences designed to assist news and media publishers and eCommerce shop owners. These experiences leverage data from a knowledge graph and employ LLMs with in-context transfer learning. This article serves as a practical demonstration of this innovative concept and offers a sneak peek into the future of agentive SEO in the era of generative AI.
As the author of this article, I invite you to interact with “AskMe,” a feature powered by the data in the knowledge graph integrated into this blog. Feel free to ask questions such as “What is this article about?” or “What are Andrea’s thoughts on structured data?” This development represents an initial stride toward empowering authors by placing them at the center of the creative process while maintaining complete control.
Conclusion: Unfolding the Blueprint of WordLift as a Generative AI SEO Platform
I usually take time to look at our roadmap as the end of the year approaches, AI is accelerating everything, including my schedule, and right after New York, I have started to review our way forward. SEO in 2023 is something different, and it is tremendously exciting to create the future of it (or at least contribute to it).
WordLift will become an intelligent orchestrator for the company’s online presence. It builds a comprehensive Knowledge Graph, the pulsing heart of the platform. To enrich data, the platform (Data Collection & Integration Layer) constantly assimilates and improves data from the company’s website, social media channels, and other data sources (the product information management system, the CRM, and so on).
WordLift is leveraging a Generative AI Layer to create engaging, SEO-optimized content. We want to further extend its creativity to visuals (Image and Video AI subsystem), enhancing any multimedia asset and creating an immersive user experience. WordLift employs a Linked Data subsystem to market metadata to search engines, improving content visibility and user engagement directly on third-party channels. We are adding a new Chatbot AI subsystem to let users engage with their audience and offer real-time assistance to end customers.
Peering through the lens of the Data Analysis & Insights Layer, WordLift needs to provide clients with critical insights and actionable recommendations, effectively acting as an SEO consultant. We are already integrating data from the KG inside reporting platforms like Microsoft Power BI and Google Looker Studio. A user-friendly interface (Dashboard) ensures that SEO teams can navigate smoothly through its functionalities. Against the backdrop, the Security and Compliance Layer shall be added to keep your data safe and in line with upcoming AI regulations (are we watermarking the content? Are we fact-checking the information generated?). The platform also features a Neural Search Engine, serving as the website’s guide, helping users navigate and find content seamlessly. Thanks to Content embedding, it understands and translates existing content into a language that an LLM can understand.
As an architect, it designs and expands the website’s structure through Ontology Import and Design. It has assimilated and extended the Dialogic Principles and Schema.org affordances developed by Teodora Petkova.
With WordLift, as I envision it, you’re not just building a website but creating an AI-powered, data-driven universe.
- How Google and Microsoft taught search to “understand” the Web – SEAN GALLAGHER
- Knowledge graphs and GraphQL: What DevOps pros need to know – Beth Pariseau
- Knowledge Graphs in End-User Products: From Cyc to AI Assistants – LinkedIn
- How Knowledge Graphs Can Help Asset Managers In 2020 – Forbes
- Enhancing Knowledge Graph Construction Using Large Language Models – Cornell University
- SKILL: Structured Knowledge Infusion for Large Language Models – Cornell University
- Enhancing Knowledge Graph Construction Large Language Models
Must Read Content
The Power of Product Knowledge Graph for E-commerce
Dive deep into the power of data for e-commerce
Why Do We Need Knowledge Graphs?
Learn what a knowledge graph brings to SEO with Teodora Petkova
Generative AI for SEO: An Overview
Use videos to increase traffic to your websites
SEO Automation in 2023
Improve the SEO of your website through Artificial Intelligence
Touch your SEO: Introducing Physical SEO
Connect a physical product to the ecosystem of data on the web