Transformer-based text generation
In recent years, there has been an increasing interest in open-ended language text generation (NLG), beginning with the release of OpenAI’s famous GPT2 model. I have been following various approaches and network architectures with the same excitement I had when I used to play with my favourite model trains Lima. In this blog post, I will highlight some of the limitations of transformer-based text generation in the context of digital marketing tasks and present data-to-text as a possible solution to these pitfalls.
The blog post contains a link to the code and the instructions to train your model using Google’s T5 on generic data-to-text generation tasks that you can adapt to your own needs 🎉.
Table of contents:
- Transformer-based text generation
- What is NLG?
- Introducing GPT-Neo
- Limits and pitfalls of automated text generation
- Creating a simple model for data to text content generation using Google’s T5
- Lessons learned and future work
What is NLG?
Natural language generation (NLG) is a subfield of natural language processing, which is in turn a subfield of artificial intelligence research. Here below I will share the code that automatically transforms data coming from a knowledge graph into plain-English content. NLG can be used, in conjunction with your editorial team to, write sentences, paragraphs or even entire articles for you.
Thanks to these sizeable transformer-based language models and libraries like Transformers by HuggingFace, state-of-the-art content generation has become as simple as writing two lines of code. The example below has been composed using GPT-Neo, a set of transformer-based language models that have been designed around the GPT architecture.
You can play with the code I used for generating this example using the Google Colab notebook linked here. Here comes another example – using GPT-Neo – with an answer to my beloved question: “Is SEO dead in 2021?”
This approach of generating content is genuinely transformative for content creators, marketers, SEOs and has indeed generated tremendous hype.
I began addressing writing meta descriptions using AI a while ago as I realised that this is one of these somehow tedious tasks that can be (semi) automated. Our content team logged tens of hours in creating these text snippets for our precious VIP clients. No kidding. Cutting down that effort by 60% has been an outstanding achievement.
Limits and pitfalls of automated text generation
At the same time, and despite the hype, even advanced model architecture such as OpenAI’s GPT-3 have severe limitations that need to be taken into account when accomplishing even simple SEO tasks.
1. The proliferation of fake news and disinformation
One of the risks is that even advanced language models have no understanding of the content that they produce. It is a fact that disinformation will get a boost as it gets straightforward to make false or misleading content that sounds credible (because of its style and choice of words).
2. Unwanted prejudices and biases
Another highly critical aspect is ethics. GPT-3, like most language models, is trained on human texts from various sources, and unfortunately, this also means that it reflects some of humanity’s worst tendencies.
3. Inability to count and do math
Language models work on statistical patterns and can predict very well what the next word or sentence will be. Unfortunately, this doesn’t mean that they can truly understand the question or the answer that they can provide. This also means that a language model cannot count or do any math operations that even a five-year-old would be able to do.
If I prompt GPT-3 with simple mathematical operations such as (100*500)-2; while Google (or any other search engine) will answer in the blink of an eye, GPT-3 gets it wrong and goes off topic; it starts blathering about the power consumption of the light inside a fridge 😟.
While a language model shall not be compared to a search engine, it is clear that GPT-3’s large brain will not pass the 1st Grade Math test.
In this other example, we see T5, brilliantly summarising web pages to answer a question about Modern SEO in 2020. Unfortunately, while it talks about 17 things to keep in mind (quoting the awesome Britney Muller), the list only includes 13 factors.
4. Limited generalization abilities
When we fine-tune a model, even language models like T5 or GPT-3 trained on a massive amount of content, we specialize it for a given use case (i.e. generating introductory text about companies in our database or product descriptions for our beautiful silk ties). In other words, the model that we obtain will only be valuable within that context and will not have optimal performances when used on other knowledge domains.
In the example below, we can try to decrease the number of epochs to increase the generalizability of our model (an epoch in machine learning indicates the number of passes of the entire training dataset the machine learning algorithm has completed). Decreasing the epochs might harm the quality of the generated text, though.
Creating a simple model for data to text content generation using Google’s T5
When working on SEO with automatically fabricated texts, we need to be even more intelligent and critical than usual. We want to create workflows where we collaboratively work with machines to create tangible and sustainable business value. This requires controlling content quality and semantic accuracy while limiting unwanted bias and prejudices.
Due to the existing limitation of the technology, we cannot (yet) use GPT or other transformer-based language models by letting them run wild. We need to have a human-in-the-loop approach.
For this reason, as deep learning is becoming more accessible by all means, I began experimenting with data-to-text generation approaches. Using a text-to-text transfer transformer like Google’s T5 we will now fine-tune a model that will link structured information (i.e. metadata available in a knowledge graph about hospitals in Texas or product details for a merchant in New Zealand) with snippets of text.
|Jump directly to the code: T5 Data-to-Text|
In a machine learning project, we use data for training and fine-tuning to convey the “rules” of our system and to control the outcome.
The project in Colab is self-explanatory but I will highlight some passages in the following paragraphs, and I encourage you to make a copy of the code and use it for your projects by replacing the training data. Mathew Alexander did the original work, and all the credit goes to him.
Let’s get started: the dataset (WebNLG Challenge)
In the code, I provided we will use as a starting point the data from the WebNLG Challenge 2020.
The dataset comprises RDF triples (facts extracted from DBpedia) and their related textual description. In the Colab we will first download the dataset and then create a Pandas data frame with a prefix (webNLG), the input_text (these are the facts separated by “|”), and a target_text (this is the text that we expect to produce for that input_text).
In the English corpus (there is also a Russian one that we will not consider) the data-text pairs are coming from 16 different DBpedia categories (or entity types): Airport, Astronaut, Building, City, ComicsCharacter, Food, Monument, SportsTeam, University, WrittenWork, Athlete, Artist, CelestialBody, MeanOfTransportation, Politician and Company.
This is important because it defines the perimeter of the content that we will be able to generate. Also, we shall keep in mind that we’re training a model using text from Wikipedia; therefore the tone of voice will be the one used by the encyclopedia (not necessarily a good fit for your website).
Fine-tuning the model
Here I mount Google Drive that will be used to store the data and both the model and the tokenizer, once the training is complete. This requires you to authenticate and let Colab access Google Drive with your credentials. We will use Google T5-base as a starting point, and I have set 4 epochs for training.
The training is done using PyTorch and, on a GPU machine, it will only take 50 minutes or so. Training is done using AdaFactor, and here you can find some helpful tips to experiment with the parameters.
Time to automatically generate text 🎉
Once the model is trained, it will be stored in Google Drive, and you will be able to load it directly from there the second time. Cell 4.1 of the Colab (as explained in the code) shall be run only if you have created the model in a previous session.
Here we see the first example on SEMRush CEO.
Lessons learned and future work.
The model is far from being perfect, and clearly, we would need a less generic dataset to obtain results that we can indeed use for SEO purposes. Nevertheless, as simple as this implementation can be, it helps us understand that we can “control” a language model and how the content is generated as long as we have the correct data to train it on. We also need to focus on automatically validating the output of the generated text to prevent the risk of hallucinations (false statements and errors). Once again, if we have data organised in a knowledge graph, we can use this data to control semantic accuracy, mistakes and omissions of the generated text. In general, AI is a real game-changer in content marketing as long as we understand its limits. We need to create collaborative systems where humans are focused on high-level thinking (data preparation, training and validation) and AI does the heavy lifting (producing content on scale).
What SEO tasks can we automate using automatic content generation?
We are currently working on the following areas:
- Article first draft: writing content pieces that use text summarisation to provide editors with an initial text on a given topic.
- FAQ generation: finding the relevant questions for a given query and attempting to write the answer to these questions.
- Product description: writing targeted descriptions for products or categories of products for E-Commerce websites
- Introductory paragraphs: writing an introduction for category pages
Are you ready to automate content creation using Natural Language Generation?
Get in contact with our SEO management service team now!
Credits and special thanks
All of this wouldn’t happen without the continuous support from our community of clients and partners experimenting with WordLift to automate their SEO. A big thank you also goes to:
- Matthew Alexander and his first blog post on data-to-text generation
- T5 by Google Research Team made super easy to use by @HuggingFace 🤗
- Luciano Floridi & Massimo Chiriatti for writing GPT-3: Its Nature, Scope, Limits, and Consequences, shedding light on the limitations of GPT-3
- The fantastic team working on WordLift Next Generation the R&D project that we are conducting with the financial support received from Eurostars H2020 along with Redlink GmbH, the Semantic Technology Institute at the Department of Computer Science of the University of Innsbruck and, SalzburgerLand Tourismus GmbH
- Teodora Petkova and the M.A. Program in Content Strategy at FH JOANNEUM University of Applied Sciences Graz for being part of this experiment (@ContentGraz) 🤩
- Paolo Benanti who opened my eyes with his long-term thinking about Algor-ethics: Artificial intelligence and ethical reflection
- My friend Hamlet Batista who would have explained all of this in a much simpler way and the amazing team at Ranksense
- OpenAI, creators of GPT-3, for releasing CLIP and DALL-E (the encoder and decoder parts, specifically) https://github.com/openai/DALL-E/ that I used to generate the featured image of this blog post “a self-writing book”
Quick note: this blog post has been digitally processed and humanly reviewed but some parts of it have been produced using T5 and GPT-3.