What type of Structured Data I need for the Homepage? 👨‍🏫

What type of Structured Data I need for the Homepage? 👨‍🏫

Some Schema.org types are beneficial for most of the businesses out there. If you have a website you want to help search engines index its content in the most simple and effective way and to do that you can start from…well, the most important page: your homepage. Technical SEO experts like Cindy Krum describes schema markup (as well as XML feeds like the one that you can provide to feed Google Shopping via the Google Merchant Center) as your new sitemap. And it is true when crawling a website (whether you are Google or any other automated crawler you might think of), getting the right information about a website is a goldmine. 

Let’s get started with our homepage. We want to let Google know from our homepage the following:

  • The organization behind the website (Publisher)
    • The logo of this organization 
    • The URL of the organization 
    • The contact information of the organization 
  • The name of the website 
  • The tagline of the website
  • The URL of the website
  • How to use the internal search engine of the website
  • The Sitelinks (the main links of the website)

We can do all of this by implementing the WebSite structured data type on the homepage of our website. A few more indications from Google on this front:

  • Add this markup only to the homepage, not to any other pages
    • 🚨very important 🚨and unfortunately on a lot of websites, you still find this markup on every single page. It should not happen: it is unnecessary.
  • Always add one SearchAction for the website, and optionally another if supporting app search (if you have a mobile app – this will help users searching from a mobile device to continue their journey on the mobile app).

Let’s have a quick look at a couple of examples:

Ready to automate the structured datamarkup on your website? Contact us!

Google’s core June 2019 algorithm change: The AI told us what’s happening [Interview]

Google’s core June 2019 algorithm change: The AI told us what’s happening [Interview]

As none of us is able to make sense of the Algorithm Changes that are happening with Google’s last Core update, yet we have dozens of clients askings us. We decided to ask the AI what’s happening with this update. If humans aren’t able to keep up with the machine let the machine tell us what’s going on.

Surprisingly we got some interesting answers (questions are from humans – answers come from a neural network we trained)!

When was the update released? 

AI: Monday, June 3, here is a summary of what we know so far. Sources say the update was pre-announced and rolled out on June 3rd. It is still very early, but Google is tweaking its algorithm and is processing the data.

What are the core updates?

AI: Core Updates are Google Updates that do not have a clear specific focus on a certain kind of search query or particular website characteristics. Instead, they make more subtle changes “under the hood”. While other major, well-known Google Updates like Panda or Penguin specifically target things like content quality or backlinks, Core Updates tend to impact websites in a wide range of areas. These updates to Google’s core algorithm are released several times a year. The first Google Core Update was recorded by Searchmetrics in January 2016.

What can I do if my rankings collapse after a Google Update?

AI: At the end of 2018, Google provided some explanation of its Core Updates, stating that there aren’t any specific errors that webmasters can correct to recover lost rankings after a Core Update. Instead, Google recommends offering users the best-possible content – this is the best way of improving rankings. Google’s recommended first step towards achieving this is to study the Google Search Quality Rater Guidelines. This document is designed to help quality raters understand, in not inconsiderable detail, how to classify and evaluate content.

Can you tell us more about the DailyMail?

AI: The SEO director of the website even seems to have taken to Google’s help forums to report a huge drop, as spotted by the blog Search Engine Roundtable. He added that they saw their “Discover traffic drop by 90%” and it “has not improved.”

What sites have been impacted the most?

AI: As of now, the update’s made some major waves. As reported by Search Engine Land, early data from Sistrix, RankRanger, SearchMetrics, and Moz suggests there have been some noticeable fluctuations in the rankings. Sistrex reports changes from 05.06. on 06.06 in the Google UK Index. They also note that the range of websites affected seems to be wider than previous updates. While the most recent updates were thought to have the most effect on YMYL sites, these ones appear to have widened the net to news sites, retail sites, etc.

Is Google evil? 

AI: Google says the new core update is “definitely noticeable.”

And since beginning its official rollout, that certainly seems to be the case.

How we did this

We trained a neural network using TensorFlow and a recently released large transformer-based language model called GPT-2 released by the team of OpenAI.

Are you Ready to Let the AI drive the traffic of you website?

Start your WordLift trial today!

SEO is a fantastic field to work on. There is always a new challenge to cope with and new things that we can learn to keep our traffic steady and to find the right audience. Core updates from Google are these events that shake the entire publishing and SEO industry as they can have a tectonic impact on traffic and search rankings, yet the dynamics of these updates remains obscure and can only be decoded after several weeks on a case-by-case basis.

Introducing Semantic Web Analytics

Introducing Semantic Web Analytics

We constantly work for content-rich websites where sometimes hundreds of new articles are published on a daily basis. Analyzing traffic trends on these large properties and creating actionable reports is still time-consuming and inefficient. This is also very true for businesses investing in content marketing that need to dissect their traffic and evaluate their marketing efforts against concrete business goals (i.e. increasing subscriptions, improving e-commerce sales and so on).

As result of this experience, I am happy to share with you a Google Data Studio report that you can copy and personalize for your own needs.

google-data-studioJump directly to the dashboard for Google Data Studio: Semantic Analytics by WordLift 

Data is meant to help transform organizations by providing them with answers to pressing business questions and uncovering previously unseen trends. This is particularly true when your biggest asset is the content that you produce.

With the ongoing growth of digitized data and the explosion of web metrics, organizations usually face two challenges:

  1. Finding what is truly relevant to untap a new business opportunity.
  2. Make it simpler for the business user to prepare and share the data, without being a data scientist.

Semantic Web Analytics is about delivering on these promises; empowering business users and let them uncover new insights – from the analysis of the traffic of their website.

We are super lucky to have a community of fantastic clients that help us shape our product and keep pushing us ahead of the curve.

Before enabling this feature, both the team at Salzburgerland Tourismus and the team at TheNextWeb had already improved their Google Analytics tracking code to store entity data as events. This allowed us to experiment, ahead of time, with this functionality before making it available to all other subscribers.

What is Semantic Web Analytics?

Semantic Web Analytics is the use of named entities and linked vocabularies such as schema.org to analyze the traffic of a website.

The natural language processing that WordLift uses to markup the content with linked entities enables us to classify articles and pages in Google Analytics with – real-world objects, events, situations or even abstract concepts.

How to activate Semantic Web Analytics?

Starting with WordLift 3.20, entities annotated in webpages can also be sent to Google Analytics by enabling the feature in the WordLift’s Settings panel.

WordLift Settings

Here is how this feature can be enabled.

You can also define the dimensions in Google Analytics to store entity data, this is particularly useful if you are already using custom dimensions.

As soon as the data starts flowing you will see a new category under Behaviour > Events in your Google Analytics.

Events in Google Analytics

Events in Google Analytics about named entities.

WordLift will trigger an event labeled with the title of the entity, every time a page containing an annotation with that entity is open.

Using these new events we can look at how content is consumed not only in terms of URLs and site categories but also in terms of entities. Moreover, we can investigate how articles are connected with entities and how entities are connected with articles.

Show me how this can impact my business

Making sense of data for a business user is about unlocking its power with interactive dashboards and beautiful reports. To inspire our clients, and once again with the help of online marketing ninjas like Martin Reichhart and Rainer Edlinger from Salzburgerland, we have built a dashboard using Google Data Studio – a free tool that helps you create comprehensive reports using data from different sources.

Using this dashboard we can immediately see, for each section of the website, what are the concepts driving the traffic, what articles are associated with these concepts and where the traffic is coming from.

An overview of the entities that drive the traffic on our website

An overview of the entities that drive the traffic on our website.

We can also see, what are the entities associated with a given article. Here below you can see the entities mentioned in the article: Implementing Structured Data for SEO with Bill Slawski.

Entities associated with an article

Entities associated with an article about structured data.

This helps publishers and business owners analyze the value behind a given topic. It can be precious to analyze the behaviors and interests of a specific user group. For example, on travel websites, we can immediately see what are the most relevant topics for let’s say Italian speaking and German speaking travelers.

WordLift’s clients in the news and media sector are also using this data to build new relationships with advertisers and affiliated businesses. They can finally bring in meetings the exact volumes they have for – let’s say – content that mentions a specific product or a category of products. This helps them calculate in advance how this traffic can be monetized.

Are you ready to make sense of your Google Analytics data? Contact us and let’s get started!

Here is the recipe for a Semantic Web Analytics dashboard in Google Data Studio 

With unlimited, free reports, it’s time to start playing immediately with Data Studio and entity data and see if and how it meets your organization’s needs.

To help with that, you can use as a starting point the report I have just created. Create your own interactive report and share with colleagues and partners (even if they don’t have direct access to your Google Analytics).

Simply take this report, make a copy, and replace with your own data!

Instructions

1. Make a Copy of this file

Go to the File menu and click to make a copy of the report. If you have never used Data Studio before, click to accept the terms and conditions, and then redo this step.

2. Do Not Request Access

Click “Maybe Later” when Data Studio warns you that data sources are not attached. If you click “Resolve” by mistake, do not click to request access – instead, click “Done”.

3. Switch Edit Toggle On

Make sure the “Edit” toggle is switched on. Click the text link to view the current page settings. The GA Demo Account data will appear as an “Unknown” data source there.

4. Create A New Data Source

If you have not created any data sources yet, you’ll see only sample data under “Available Data Sources” – in that case, scroll down and click “Create New Data Source” to add your own GA data to the available list.

5. Select Your Google Analytics View

Choose the Google Analytics connector, and authorize access if you aren’t signed in to GA already. Then select your desired GA account, property, and the view from each column.

6. Connect to Your GA Data

Name your data source (at the top left), or let it default to the name of the GA view. Click the blue “Connect” button at the top right.

Are you ready to build you first Semantic Dashboard? Add me on LinkedIn and let’s get started!

Read more about WordLift’s new Content Dashboard that combines entities with search rankings.

We take on a small handful of clients projects each year to help them boost their qualified traffic via our SEO Management Service

Do you want to be part of it?

Yes, send me a quote!

The Ultimate Checklist to Optimize Content for Google Discover

The Ultimate Checklist to Optimize Content for Google Discover

The shift from keyword search to a queryless way to get information has arrived

Google Discover is an AI-driven content recommendation tool included with the Google Search app. Here is what we learned from the data available in the Google Search Console.

Google introduced Discover in 2017 and it claims that there are already 800M active users consuming content using this new application. A few days back Google added in the Google Search Console statistical data on the traffic generated by Discover. This is meant to help webmasters, and publishers in general, understand what content is ranking best on this new platform and how it might be different from the content ranking on Google Search.

 

What was very shocking for me to see, on some of the large websites we work for with our SEO management service, is that between 25% and 42% of the total number of organic clicks are already generated by this new recommendation tool. I did expect Discover to drive a significant amount of organic traffic but I totally underestimated its true potentials.

A snapshot from GSC on a news and media site

In Google’s AI-first approach, organic traffic is no longer solely dependent on queries typed by users in the search bar.

This has a tremendous impact on both content publishers, business owners and the SEO industry as a whole.

Machine learning is working behind the scenes to harvest data about users’ behaviors, to learn from this data and to suggest what is relevant for them at a specific point in time and space.

Let’s have a look at how Google explains how Discover works.

From www.blog.google

[…] We’ve taken our existing Knowledge Graph—which understands connections between people, places, things and facts about them—and added a new layer, called the Topic Layer, engineered to deeply understand a topic space and how interests can develop over time as familiarity and expertise grow. The Topic Layer is built by analyzing all the content that exists on the web for a given topic and develops hundreds and thousands of subtopics. For these subtopics, we can identify the most relevant articles and videos—the ones that have shown themselves to be evergreen and continually useful, as well as fresh content on the topic. We then look at patterns to understand how these subtopics relate to each other, so we can more intelligently surface the type of content you might want to explore next.

Embrace Semantics and publish data that can help machines be trained.

Once again, the data that we produce, sustains and nurture this entire process. Here is an overview of the contextual data, besides the Knowledge Graph and the Topic Layer that Google uses to train the system:

To learn more about Google’s work on query prediction, I would suggest you read an article by Bill Slawski titled “How Google Might Predict Query Intent Using Contextual Histories“.

What I learned by analyzing the data in GSC

This research is limited to the data gathered from three websites only, while the sample was small few patterns emerged:

  1. Google tends to distribute content between Google Search and Google Discover (the highest overlap I found was 13.5% – these are pages that, since Discover data has been collected on GSC, have received traffic from both channels)
  2. Pages in Discover have not the highest engagement in terms of bounce rate or average time on page when compared to all other pages on a website. They are relevant for a specific intent and well-curated but I didn’t see any correlation with social metrics.
  3. Traffic seems to work with a 48-hours or 72-hours spike as already seen for the top stories.

To optimize your content for Google Discover, here is what you should do.

1. Make sure you have an entity in the Google Knowledge Graph or an account on Google My Business

Entities in the Google Knowledge Graph need to be created in order for Discover to be able to recognize them.

Results for WordLift

Results for WordLift

For business owners

Either your business, or product, is already in the Google Knowledge Graph or it is not. If it is not, there are no chances that the content you are writing about for your company or product will appear in Discover (unless this content is bound to other broader topics). I am able to read articles about WordLift in my Discover stream since WordLift has an entity in the Google Knowledge Graph. From the configuration screenshot above we can actually see there are indeed more entities when I search for “WordLift”:

  • one related to Google My Business (WordLift Software Company in Rome is the label we use on GMB),
  • one from the Google Knowledge Graph (WordLift Company)
  • one presumably about the product (without any tagline)
  • one about myself as CEO of the company

So, get into the graph and make sure to curate your presence on Google My Business. Very interestingly we can see the relationship between myself and WordLift is such that when looking for WordLift, Google shows also Andrea Volpini as a potential topic of interest.

In these examples, we see that from Google Search I can start following persons that are already in the Google Knowledge Graph and the user experience in Discover for content related to the entity WordLift.

In these examples, we see that from Google Search I can start following persons that are already in the Google Knowledge Graph and the user experience in Discover for content related to the entity WordLift.

2. Focus on high-quality content and a great user experience

It is good also to remember that the quality in terms of both the content you write (alignment with Google’s content quality policies) and the user experience on your website is essential. A website that loads on a mobile connection in 10 seconds or more is not going to be featured in Discover. A clickbait article, with more ads than content, is not going to be featured in Discover. An article written by copying other websites and patently infringing copyrights laws is not likely to be featured in Discovery.

3. Be relevant and write content that truly helps people by responding to their specific information need

Recommendations tools like Discover only succeed when they are capable of enticing the user to click on the suggested content. To do so effectively they need to work with content designed to answer a specific request. Let’s see a few examples “I am interested in SEO” (entity “Search Engine Optimization“), or “I want to learn more about business models” (entity “Business Model”).

The more we can match the intent of the user, in a specific context (or micro-moment if you like), the more we are likely to be chosen by a recommendation tool like Discover.

4. Always use an appealing hi-res image and a great title

Images play a very important role in Google‘s card-based UI as well as in Discover. Whether you are presenting a cookie recipe or an article, the image you chose will be presented to the user and will play its role in enticing the click. Besides the editorial quality of the image I also suggest you follow the AMP requirements for images (the smallest side of the featured image should be at least 1.200 px). Similarly, a good title, much like in the traditional SERP is super helpful in driving the clicks.

5. Organize your content semantically

Much like Google does, using tools like WordLift, you can organize content with semantic networks and entities. This allows you to: a) help Google (and other search engines) gather more data about “your” entities b) organize your content the same way Google does (and therefore measure its performance by looking at topics and not pages and keywords) c) train our own ML models to help you make better decisions for your business.

Let me give you a few examples. If I provide, let’s say the information about our company, and the industry we work for using entities that Google can crawl. Google‘s AI will be able to connect content related to our business with people interested in “startups”, “seo” and “artificial intelligence“. Machine learning, as we usually say, is hungry for data and semantically rich data is what platforms like Discover use to learn how to be relevant.

If I look at the traffic I generate on my website, not only in terms of pages and keywords but using entities (as we do with our new search rankings dashboard or the Google Analytics integration) I can quickly see what content is relevant for a given topic and improve it.

WordLift Dashboard

Use entities to analyze our your content is performing on organic search

Here below a list of pages, we have annotated with the entity “Artificial Intelligence“. Are these pages relevant for someone interested in AI? Can we do a better job in helping these people learn more about this topic?

A detail of the WordLift dashboard

A few of the articles tagged with the entity “Artificial Intelligence” and their respective query

Learn more about Google Discover – Questions & Answers

Following in this article, I have a list of questions that I have answered in these past days as data from Discover was made available in GSC. I hope you’ll find it useful too.

How does Discover work from the end-user perspective?

The suggestions in Discover are entity-based. Google groups content that believes relevant using entities in its Knowledge Graph (i.e. “WordLift”, “Andrea Volpini”, “Business” or “Search Engine Optimization“). Entities are called topics. The content-based user filtering algorithm behind Discover can be configured from a menu in the application (“Customize Discover”) and fine-tuned over time by providing direct feedback on the recommended content in the form of “Yes, I want more of this”, “No, I am not interested”. Using Reinforcement Learning (a specific branch of Machine Learning) and Neural Matching (different ways of understanding what the content is about) the algorithm is capable of creating a personalized feed of information from the web. New topics can be followed by clicking on the “+” sign.

Topics are organized in a hierarchy of categories and subcategories (such as “Sport”, “Technology”). Read more here on how to customize Google Discover.

How can I access Discover?

On Android, in most devices, accessing Discover is as simple as swiping, from the home screen to the right.

Is Google Discover available only in the US?

No, Google Discover is already available worldwide and in multiple languages and it is part of the core search experience on all Android devices and on any iOS devices with the Google Search app installed. Discover is also available in Google Chrome.

Do I have to be on Google News to be featured in Discover?

No, Google Discover uses also content that is not published on Google News. It is more likely that a news site will appear on Google Discover due to the amount of content published every day and the different topics that a news site usually covers.

Is evergreen content eligible for Discover or only freshly updated articles are?

Evergreen content, that fits a specific information need, is as important as newsworthy content. I spotted an article from FourWeekMBA.com (Gennaro’s blog on business administration and management) that was published 9 months ago under the entity “business”.

FourWeekMBA on Discover

Does a page need to rank high on Google Search to be featured in Discover?

Quite interestingly, on a news website where I analyzed the GSC data, only 13.5% of the pages featured in Discover had received traffic on Google Search. Pages that received traffic on both channels had a position on Google Search <=8.

Correlation of Discover_Clicks, Google Search_Position

Correlation of Google Discover Clicks and Google Search Position

How can I measure the impact of Discover from Google Analytics?

A simple way is to download the .csv file containing all the pages listed in the Discover report in GSC and create an advanced filter in Google Analytics under Behaviour > Site Content > All pages with the following combination of parameters:

Filtering all pages that have received traffic from Discover in Google Analytics

Filtering all pages that have received traffic from Discover in Google Analytics

Discover is yet another important step in the evolution of search engines in answer and discovery machines that help us sift in today’s content multiverse.

Keep following us, and give WordLift a spin with our free trial!

How to build a keyword suggestion tool using TensorFlow

How to build a keyword suggestion tool using TensorFlow

One of the most fascinating features of deep neural networks applied to NLP is that, provided with enough examples of human language, they can generate text and help us discover many of the subtle variations in meanings. In a recent blog post by Google research scientist Brian Strope and engineering director Ray Kurzweil we read:

“The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc.”

Following this hierarchical structure, new computational language models, aim at simplifying the way we communicate and have silently entered our daily lives; from Gmail “Smart Reply” feature to the keyboard in our smartphones, recurrent neural network, and character-word level prediction using LSTM (Long Short Term Memory) have paved the way for a new generation of agentive applications.

From keyword research to keyword generation

As usual with my AI-powered SEO experiments, I started with a concrete use-case. One of our strongest publishers in the tech sector was asking us new unexplored search intents to invest on with articles and how to guides. Search marketers, copywriters and SEOs, in the last 20 years have been scouting for the right keyword to connect with their audience. While there is a large number of available tools for doing keyword research I thought, wouldn’t it be better if our client could have a smart auto-complete to generate any number of keywords in their semantic domain, instead than keyword data generated by us? The way a search intent (or query) can be generated, I also thought, is also quite similar to the way a title could be suggested during the editing phase of an article. And titles (or SEO titles), with a trained language model that takes into account what people search, could help us find the audience we’re looking for in a simpler way.

Jump directly to the code: Interactive textgenrnn Demo w/ GPU for keyword generation

The unfair advantage of Recurrent Neural Networks

What makes an RNNs “more intelligent” when compared to feed-forward networks, is that rather than working on a fixed number of steps they compute sequences of vectors. They are not limited to process only the current input, but also everything that they have perceived previously in time.

A diagram of a Simple Recurring Network by Jeff Helman

A diagram of a Simple Recurring Network by Jeff Helman

This characteristic makes them particularly efficient in processing human language (a sequence of letters, words, sentences, and paragraphs) as well as music (a sequence of notes, measures, and phrases) or videos (a sequence of images).

RNN, I learned in the seminal blog post by Andrej Karpathy on their effectiveness, are considered Turing-Complete: this basically means that they can potentially build complete programs.  

RNN vs FFNN

Here above you can see the difference between a recurrent neural network and a feed-forward neural network. Basically, RNNs have a short-memory that allow them to store the information processed by the previous layers. The hidden state is looped back as part of the input. LSTMs are an extension of RNNs whose goal is to “prolong” or “extend” this internal memory – hence allowing them to remember previous words, previous sentences or any other value from the beginning of a long sequence.

The LSTM cell where each gate works like a perceptron.

Imagine a long article where I explained that I am Italian at the beginning of it and then this information is followed by other let’s say 2.000 words. An LSTM is designed in such a way that it can “recall” that piece of information while processing the last sentence of the article and use it to infer, for example, that I speak Italian. A common LSTM cell is made of an input gate, an output gate and a forget gate. The cell remembers values over a time interval and the three gates regulate the flow of information into and out of the cell much like a mini neural network. In this way, LSTMs can overcome the vanishing gradient problem of traditional RNNs.    

If you want to learn more in-depth on the mathematics behind recurrent neural networks and LSTMs, go ahead and read this article by Christopher Olah.  

Let’s get started: “Io sono un compleanno!”

After reading Andrej Karpathy’s blog post I found a terrific Python library called textgenrnn by Max Woolf. This library is developed on top of TensorFlow and makes it super easy to experiment with Recurrent Neural Network for text generation.

Before looking at generating keywords for our client I decided to learn text generation and how to tune the hyperparameters in textgenrnn by doing a few experiments.

AI is interdisciplinary by definition, the goal of every project is to bridge the gap between computer science and human intelligence.

I started my tests by throwing in the process a large text file in English that I found on Peter Norvig’s website (https://norvig.com/big.txt) and I end up, thanks to the help of Priscilla (a clever content writer collaborating with us), “resurrectingDavid Foster Wallace with its monumental Infinite Jest (provided in Italian from Priscilla’s ebook library and spiced up with some of her random writings).

At the beginning of the training process – in a character by character configuration – you can see exactly what the network sees: a nonsensical sequence of characters that few epochs (training iteration cycles) after will transform into proper words.  

As I became more accustomed to the training process I was able to generate the following phrase:

“Io sono un compleanno. Io non voglio temere niente? Come no, ancora per Lenz.”

I’m a birthday. I don’t want to fear anything? And, of course, still for Lenz.

David Foster Wallace

David Foster Wallace

Unquestionably a great piece of literature 😅that gave me the confidence to move ahead in creating a smart keyword suggest tool for our tech magazine.

The dataset used to train the model

As soon as I was confident enough to get things working (this means basically being able to find a configuration that – with the given dataset – could produce a language model with a loss value equal or below 1.0), I asked Doreid, our SEO expert to work on WooRank’s API and to prepare a list of 100.000 search queries that could be relevant for the website.

To scale up the number we began by querying Wikidata to get a list of software for Windows that our readers might be interested to read about. As for any ML, project data is the most strategic asset. So while we want to be able to generate never-seen-before queries we also want to train the machine on something that is unquestionably good from the start.

The best way to connect words to concepts is to define a context for these words. In our specific use case, the context is primarily represented by software applications that run on the Microsoft Windows operating system. We began by slicing the Wikidata graph with a simple query that provided us with the list of 3.780+ software apps that runs on Windows and 470+ related software categories. By expanding this list of keywords and categories, Doreid came up with a CSV file containing the training dataset for our generator.

 

The first rows in the training dataset.

The first rows in the training dataset.

After several iterations, I was able to define the top performing configuration by applying the values below. I moved from character-level to word-level and this greatly increased the speed of the training. As you can see I have 6 layers with 128 cells on each layer and I am running the training for 100 epochs. This is indeed limited, depending on the size of the dataset, by the fact that Google Colab after 4 hours of training stops the session (this is also a gentle reminder that it might be the right time to move from Google Colab to Cloud Datalab – the paid version in Google Cloud).

Textgenrnn configuration

Textgenrnn configuration

Here we see the initial keywords being generated while training the model

Rock & Roll, the fun part

After a few hours of training, the model was ready to generate our never-seen-before search intents with a simple python script containing the following lines.  

Here a few examples of generated queries:

where to find google drive downloads
where to find my bookmarks on google chrome
how to change your turn on google chrome
how to remove invalid server certificate error in google chrome
how to delete a google account from chrome
how to remove google chrome from windows 8 mode
how to completely remove google chrome from windows 7
how do i remove google chrome from my laptop

You can play with temperatures to improve the creativity of the results or provide a prefix to indicate the first words of the keyword that you might have in mind and let the generator figure out the rest.

Takeaways and future work

“Smart Reply” suggestions can be applied to keyword research work and is worth assessing in a systematic way the quality of these suggestions in terms of:

  • validity – is this meaningful or not? Does it make sense for a human?
  • relevance – is this query really hitting on the target audience the website has? Or is it off-topic? and
  • impact – is this keyword well-balanced in terms of competitiveness and volume considering the website we are working for?

The initial results are promising, all of the initial 200+ generated queries were different from the ones in the training set and, by increasing the temperature, we could explore new angles on an existing topic (i.e. “where is area 51 on google earth?”) or even evaluate new topics (ie. “how to watch android photos in Dropbox” or “advertising plugin for google chrome”).

It would be simply terrific to implement – with a Generative Adversarial Network (or using Reinforcement Learning) – a way to help the generator produce only valuable keywords (keywords that – given the website – are valid, relevant and impactful in terms of competitiveness and reach). Once again, it is crucial to define the right mix of keywords we need to train our model (can we source them from a graph as we did in this case? shall we only use the top ranking keywords from our best competitors? Should we mainly focus on long tail, conversational queries and leave out the rest?).

One thing that emerged very clearly is that: experiments like this one (combining LSTMs and data sourcing using public knowledge graphs such as Wikidata) are a great way to shed some light on how Google might be working in improving the evaluation of search queries using neural nets. What is now called “Neural Matching” might most probably be just a sexy PR expression but, behind the recently announced capability of analyzing long documents and evaluating search queries, it is fair to expect that Google is using RNNs architectures, contextual word embeddings, and semantic similarity. As deep learning and AI, in general, becomes more accessible (frameworks are open source and there is a healthy open knowledge sharing in the ML/DL community) it becomes evident that Google leads the industry with the amount of data they have access to and the computational resources they control.

Credits

This experiment would not have been possible without textgenrnn by Max Woolf and TensorFlow. I am also deeply thankful to all of our VIP clients engaging in our SEO management services, our terrific VIP team: Laura, Doreid, Nevine and everyone else constantly “lifting” our startup, Theodora Petkova for challenging my robotic mind 😅and my beautiful family for sustaining my work.

Stand out on search in 2019. Get 50% off WordLift until January 7th Buy Now!

x