Back to Blog

How Google Advances In Image And Language Understanding

Table of contents

What Challenges Do Multimodal Models Face?

The mix of pictures and text descriptions pulled all at once from the Internet has shown to be a powerful resource for artificial intelligence training. We already witnessed the rise of multimodal search and the rise of some prominent models like OpenAI CLIP and DALL-E. These trained, self-monitoring AI models have one big advantage: they learn a much more robust representation of visual categories since they don’t have to rely on human-defined classifications. In plain English, that means that these models are able to perform image analysis tasks without additional AI training.

However, multimodal models are facing some challenges: they are trained on image data that is using ImageNet as a benchmark and show poor performance for some specific and expert topics. This is also reflected in Google Image Search. Similar biases can be found when using CLIP in conjunction with a Diffusion model and on Google Image Search). That is why we study and use these models because this helps us do SEO in a multimodal-first world (that Google is pivoting with MUM).

Introducing The Locked Image Tuning (Lit) Method By Google

Google’s computer scientists managed to create a new method for image analysis that combines the best of the two worlds: a multimodal model with powerful image analysis capabilities without the need to retrain for new tasks, yet achieve the precision of specialized models. The difference here is that Google’s LiT trains only the text encoder. This is different from their previous multimodal approach where an image encoder learns image representations while the text encoder learns the representation of the corresponding text.

Google is changing the game with LiT. They are dealing with a pre-trained model that uses three billion images that serve as the image encoder. The way this works is that the parameters of the model are frozen in the process of multimodal training. This approach ensures that the image encoder and its learned representations are not modified. The AI team used a private dataset covering four billion images with associated text that Google collected in previous years.

Does Clip Perform Better Than The Locked Image Tuning Method?

The industry benchmark for computer vision is usually ImageNet. The model trained with this new LiT method achieves 84.5% accuracy on ImageNet while at the same time it achieves 81.1% accuracy in ObjectNet benchmark without additional training.

It is worth knowing that the best value that is achieved with ImageNet is around 91%, while CLIP achieved around 76%. At the same time, CLIP achieved 72.3 percent accuracy in the ObjectNet benchmark.

In any case, it is worth observing that in any case, CLIP has been a truly turning point in this area. To sum it up, the power of CLIP and LiT is the same and lies in the ability to assess the similarity between an image and a piece of text.

Another Google Advancement: Pix2seq – The New Language Interface For Object Detection

Object detection is useful for content moderation and image understanding. We wrote about our object detection experiments in our previous post about metaverse SEO. The need to understand different scenes while avoiding object duplication adds to the complexity of localizing only relevant object instances. 

Another challenge that the current approaches based on Faster R-CNN and DETR face is to reduce the ability of the model to generalize for other tasks. The need for redesign is clear, so the Google team proposed a new approach to ICLR 2022 , the Tenth International Conference on Learning Representations, called Pix2Seq.

Pix2Seq takes pixel inputs for object detection. This new model achieves fantastic results on the popular large-scale COCO dataset. The idea is that for a given image if the neural network knows where the objects are located, one can simply teach how to read them out. The neural network is basically learning how to describe objects, so the model can learn useful object representations just based on pixel observations.

In simple terms, this means that if we provide an image to the Pix2Seq model, it will give a sequence of object descriptions as an output where each object will be described through the coordinates of the bounding box’s corners and a class label just like on the picture below.

Source: Google blog

What Do Pix2seq And Lit Mean For The SEO World?

Object detection in images will definitely provide an extra layer of ranking signals that cannot be easily obtained or re-engineered back which will make things harder for SEOs that are already juggling with more than 200 ranking factors. At the same time, this can be also used as a shady technique for object stuffing in the images or metaverse-alike environments and this will pose a challenge for Google to recognize quality visual environments at scale.

It is worth noting that current SEO software does not have the ability to analyze images and videos due to the complexity of the analysis that needs to be performed as well as the great amount of computing power that needs to be present in the first place.

Looking on the positive side, the images (and videos) can finally “talk” much like text already does: these new models and methods can be integrated into Big G’s products like Google photo and image search, YouTube, and self-driving cars. The implications of these advancements in SEO are pretty big and so is the importance of having semantically rich data to train these models. Today we can quickly train (like we do for the multimodal search) CLIP to help us detect features from the image of a product and this is something that we, SEOs should not underestimate. 

To sum it up, the future is already here – organic results will not look like they look today: quality images, well-defined objects, and themed images are definitely likely to become central.