In this article, I will share my findings while attempting to use neural networks to describe the content of images. Images greatly contribute to a website’s SEO and improve the overall user experience. Fully optimizing images is about helping users, and search engines, better understand the content of an article.
The SEO community has always been quite keen in recommending publishers to invest on visual elements and this has become even more important in 2019 as Google keeps on revamping Google Image Search by adding new filters and new functionalities.
There are several aspects that Google mentions in its list of best practices for images but the work I’ve been focusing on, for this article, is about providing alt text and captions in a semi-automated way. Alt text and captions, in general, improve accessibility for people that use screen-readers or have limited connectivity and help search engines understand what the content of an article is about.
“Google Images and Video search is often overlooked, but they have massive potential.”
“We simply know that media search is way too ignored for what it’s capable doing for publishers so we’re throwing more engineers at it as well as more outreach.”
– Gary Illyes, Google’s Chief of Sunshine and Happiness & trends analyst
Let’s start with the basic of Image SEO with this historical video from Matt Cutts that, back in 2007, explained to webmasters worldwide the importance of descriptive alt text in images.
Agentive SEO: AI that works for webmasters…sort of
The work we do at WordLift with our partner WooRank aims at building agentive technologies for digital marketers. I had the pleasure of meeting Christopher Noessel in San Francisco and learned from him the principles of agentive technology (Chris has written a terrific book that I recommend you to read called Designing Agentive Technologies). One of the most important aspects in designing agentive tech is to focus on efficient workflows to augment humans intelligence with the power of machines by taking into account the strengths and the limitations of today’s AI.
The workflow to enrich image metadata in WordPress
In this experiment we proceed as follow:
- we start by downloading the XML export feed for media files using the WordPress Export tool
- we send a request to the Microsoft Vision APIs
- we store the results in a CSV file that we can later use to check and validate the outcome of the analysis with Google Sheets (or Excel) using the power of our natural intelligence ?
- we add back the descriptions in the CMS with an importer (I didn’t develop this part yet but there are already plugins that import data stored in CSV files in the WordPress database).
Purely relying on machines is not really an option to improve your image SEO and I will show you why. Nevertheless, a strong-willed editor with the code described in this article can curate hundreds of images in a few hours.
Keep on reading if you are interested in ML experiments or simply jump at the end of the article to get the code I finally used to enrich the media library of one of the clients of our SEO managed services.
Get Comfortable with experiments
Machine learning requires a new mindset: way different from the mindset we have in traditional programming. You tend to write less code and to focus most of the attention in the data being used for training the model but … in the end, will the model you are building be usable in a real-world environment? Can you really rely on it to improve your search rankings? Hard to say from the start.
The advantages of setting up your own pipeline for training an ML model are obvious – especially if, like us, you are building a product that thousands of people will use:
- You are totally independent of external providers (this usually means you keep control of the costs)
- You can fine-tune the data as well as the model for the needs of your users
Armed with passion and enthusiasm I set up a model for image captioning roughly following the architecture outlined in this article “Automatic Image Captioning using Deep Learning (CNN and LSTM) in PyTorch“ that is based on the results published in the “Show and Tell: A Neural Image Caption Generator” paper by Vinyals et al., 2014.
The implementation is based on a combination of two different networks:
- A pre-trained resnet-152 model that acts as an encoder. It transforms the image in a vector of features that is sent to the decoder
- A decoder that uses an LSTM network (LSTM stands for Long short-term memory and it is a Recurrent Neural Network) to compose the phrase that describes the featured vector received from the encoder. LSTM, I learned along the way, are used by Google and Alexa for speech recognition, Google also uses it in the Google Assistant and in Google Translate.
One of the main dataset used for training in image captioning is called COCO and is made of a vast number of images, each image has 5 different captions that describe it.
I quickly realized that training the model on my laptop would have required almost 17 days no-stop with the CPU running at full throttle. I had to be realistic and I downloaded the pre-trained model that was available.
RNN for sure are not hardware friendly and use an enourmous amout of resources for training.
Needless to say, I remained speechless as soon as everything was in place and I was ready to make the model talk for the first time. By providing the image below the result was encouraging.
Unfortunately, as I moved forward with the experiments and from the giraffes moved into a more mundane scenery (the team in the office) the results were bizarre, to use a euphemism, and far from being usable in our competitive SEO landscape.
Don’t settle for less than the best model
As I kept experimenting with different images, while happy that I was now able to fully control all the parameters I had to accept that this implementation of the Show and Tell paper was not good enough for our users. Great for generative poetry perhaps but, no good for SEO.
While I am still evaluating new alternatives (there is a very promising attention model implementation in TensorFlow that I would love to test) I had to focus on what the industry considers state-of-the-art for this specific tasks: the Microsoft Vision API. You can play directly online using the http://captionbot.com website and you will see that the results are significantly different than my homebrewed image captioning model in PyTorch.
Microsoft wisely offers a freemium model and you have up to 5.000 API calls per month to get started without opening your wallet.
Fasten your seatbelts and run the analysis
In order to optimize the description of images for anyone running WordPress, I prepared a script in Python that uses the Microsoft Computer Vision API and that you can find on GitHub.
You will need an API key from Microsoft and the export of your WordPress Media Library in XML that can be generated using the WordPress Export Tool.
The result, from running the script, is a CSV file that contains the URL of the image, the title of the image, the proposed description of the image and a confidence score. This confidence score is very useful to quickly filter the results and to focus your attention where is needed the most (as you can see from the image below there is a big difference between the first image that has a score of 0.5 and the image right after that has a score of 0.8).
Once the data is validated by an editor using Excel or Google Sheet it can be imported back into WordPress using any plugin that imports CSV in the database or a custom script (still need to write it).
Follow the instructions on GitHub or write me an email if are interested in doing image SEO with the help of machine learning. The code is far from perfect and has been only tested on a couple of websites (please use it at your own risk).
Experimenting in ML is essential. A great wealth of resources including pre-trained machine learning models are available and can encode knowledge to help us in SEO tasks.
While the state-of-the-art neural network from Microsoft still interprets a young Bill Slawski (alongside an even younger Neil Patel) as … yes, a woman with a proper workflow you can still get very useful results to scale up your SEO productivity for image tagging.
In the coming weeks, we will keep on testing this approach and hopefully measuring some positive impact in terms of organic traffic (this blog post is still really a work in progress). It is also worth keep on testing new ML networks that take advantage of hierarchical neural attention; these new approaches are superseding models based on RNN / LSTM (here is a good article on the topic).