With the rise of knowledge graphs (KGs), interlinking KGs has attracted a lot of attention. Finding similar entities among KGs plays an essential role in knowledge integration and KG connection. It can help end-users and search engines more effectively and easily access pertinent information across KGs.
In this blog post, we introduce a new research paper and the approach that we are experimenting with within the context of Long-tail SEO.
One of the goals that we have for WordLift NG is to create the technology required for helping editors go after long-tail search intents. Long-tail queries are search terms that tend to have lower search volume and competition rate, as well as a higher conversion rate. Let me give you an example: “ski touring” is a query that we can intercept with a page like this one (or with a similar page). Our goal is twofold:
helping the team at SalzburgerLand Tourismus (the use-case partner of our project) expand on their existing positioning on Google by supporting them in finding long-tail queries;
helping them enrich their existing website with content that matches that long-tail query and that can rank on Google.
In order to facilitate the creation of new content we proceed as follows:
analyze the entities behind the top results that Google proposes (in a given country and language) for a given query.
find a match with similar entities on the local KG of the client.
To achieve the first objective WordLift has created an API (called long-tail) that will analyze the top results and extract a short summary as well as the main entities behind each of the first results.
Now given a query entity in one KG (let’s say DBpedia), we intend to propose an approach to find the most similar entity in another KG (the graph created by WordLift on the client’s website) as illustrated in Figure 1.
The main idea is to leverage graph embedding, clustering, regression and sentence embedding as shown in Figure 2.
In our proposed approach, RDF2Vec technique has been employed to generate vector representations of all entities of the second KG and then the vectors have been clustered based on cosine similarity using K medoids algorithm. Then, an artificial neural network with multilayer perceptron topology has been used as a regression model to predict the corresponding vector in the second knowledge graph for a given vector from the first knowledge graph. After determining the cluster of the predicated vector, the entities of the detected cluster are ranked through the sentence-BERT method and finally, the entity with the highest rank is chosen as the most similar one. If you are interested in our work, we strongly recommend you to read the published paper.
Conclusions and future work
To sum up, the proposed approach to find the most similar entity from a local KG with a given entity from another KG, includes four steps: graph embedding, clustering, regression and ranking. In order to evaluate the approach presented, the DBPedia and SalzburgerLand KGs have been used as the KGs and the available entity pairs which have the same relation, have been considered as training data to train the regression models. The absolute error (MAE), R squared (R2) and root mean square error (RMSE) have been applied to measure the performance of the regression model. In the next step, we will show how the proposed approach leads to enriching the SalzburgerLand website when it takes the main entities from the long-tail API and finds the most similar entities in the SalzburgerLand KG.
Reference to the paper with full details: Aghaei, S., Fensel, A. „Finding Similar Entities Across Knowledge Graphs“, in Proceedings of the 7th International Conference on Advances in Computer Science and Information Technology (ACSTY 2021), Volume Editors: David C. Wyld, Dhinaharan Nagamalai, March 20-21, 2021, Vienna, Austria.
If you are a developer, you probably have already worked with or heard about SEO (Search Engine Optimization). Nowadays, when optimizing websites for search engines, the focus is on annotating websites’ content so that search engines can easily extract and “understand” the content. Annotating, in this case, is the representation of information presented on a website in a machine-understandable way by using a specific predefined structure. Noteworthy, the structure must be understood by the search engines. Therefore, in 2011 the four most prominent search engine providers Google, Microsoft, Yahoo!, and Yandex, founded Schema.org. Schema.org provides patterns for the information you might want to annotate on your websites, including some examples. Those examples allow web developers to get an idea of making the information on their website understandable by search engines.
Besides using the websites’ annotations to provide more precise results to the users, search engines use them to build so-called Knowledge Graphs. Knowledge Graphs are huge semantic nets describing “things” and their connections between each other.
Consider three “things”, i.e. three hiking trails “Auf dem Jakobsweg”, “Lofer – Auer Wiesen – Maybergklamm” and “Wandergolfrunde St. Martin” which are located in the region “Salzburger Saalachtal” (another “thing”). “Salzburger Saalachtal” is located in the state “Salzburg,” which is part of “Austria.” If we drew those connections on a sheet, we would end up with something that looks like the following.
This is just a small extract of a Knowledge Graph, but it shows pretty well how things are connected with each other. Search engine providers collect data from a vast amount of websites and connect the data with each other. Not only search engine providers are doing so but even more companies are building Knowledge Graphs. Also, you can build a Knowledge Graph based on your annotations, as they are a good starting point. Now you might think that the amount of data is not sufficient for a Knowledge Graph. It is essential to mention that you can connect your data with other data sources, i.e., link your data or import data from external sources. There exists a vast Linked Open Data Cloud providing linked data sets of different categories. Linked in this case means that the different data sets are connected via certain relationships. Open implies that everyone can use it and import it into its own Knowledge Graph.
An excellent use case for including data from the Linked Open Data Cloud is to integrate geodata. For example, as mentioned earlier, the Knowledge Graph should be built based on the annotations of hiking trails. Still, you don’t have concrete data on the cities, regions, and countries. Then, you could integrate geodata from the Linked Open Data Cloud, providing detailed information on cities, regions, and countries.
Over time, your Knowledge Graph will grow and become quite huge and even more powerful due to all the connections between the different “things.”
Sounds great, but how can I use the data in the Knowledge Graph?
Unfortunately, this is where a huge problem arises. For querying the Knowledge Graph, it is necessary to write so-called SPARQL queries, a standard for querying Knowledge Graphs. SPARQL is challenging to use if you are not familiar with the syntax and has a steep learning curve. Especially, if you are not into the particular area of Semantic Web Technologies. In that case, you may not want to learn such a complex query language that is not used anywhere else in your daily developer life. However, SPARQL is necessary for publishing and accessing Linked Data on the Web. But there is hope. We would not write this blog post if we did not have a solution to overcome this gap. We want to give you the possibility, on the one hand, to use the strength of Knowledge Graphs for storing and linking your data, including the integration of external data, and on the other hand, a simple query language for accessing the “knowledge” stored. The “knowledge” can then be used to power different kinds of applications, e.g., intelligent personal assistants. Now you have been tortured long enough. We will describe a simple middleware that allows you to query Knowledge Graphs by using the simple syntax of GraphQL queries.
What is GraphQL?
GraphQL is an open standard published in 2015, initially invented by Facebook. Its primary purpose is to be a flexible and developer-friendly alternative to REST APIs. Before GraphQL, developers had to use API results as predefined by the API provider even if only one value was required by the user of the API. GraphQL allows specifying a GraphQL query in a way that only the relevant data is fetched. Additionally, the JSON syntax of GraphQL makes it easy to use. Nearly every programming language has a JSON parser, and developers are familiar with representing data using JSON syntax. The simplicity and ease of use also gained interest in the Semantic Web Community as an alternative for querying RDF data. Graph database (used to store Knowledge Graphs) providers like Ontotex (GraphDB) and Stardog introduced GraphQL as an alternative query language for their databases. Unfortunately, those databases can not be exchanged easily due to the different kinds of GraphQL schemas they require. The GraphQL schema defines which information can be queried. Each of the database providers has its own way of providing this schema.
Additionally, the syntax of the GraphQL queries supported by the database providers differs due to special optimizations and extensions. Another problem is that there are still many services available on the Web that are only accessible via SPARQL. How can we overcome all this hassle and reach a simple solution applicable to arbitrary SPARQL endpoints?
All those problems led to a conceptualization and implementation of a middleware transforming GraphQL into SPARQL queries called GraphSPARQL. As part of the R&D work that we are doing, in the context of the EU-cofounded project called WordLift Next Generation, three students from the University Innsbruck developed GraphSPARQL in the course of a Semantic Web Seminar
Let us consider the example of a query that results in a list of persons’ names to illustrate the functionality of GraphSPARQL. First, the user needs to provide an Enriched GraphQL Schema, in principle defining the information that should be queryable by GraphSPARQL. This schema is essential for the mapping between the GraphQL query and the SPARQL query.
The following figure shows the process of an incoming query and transforming it to a SPARQL query. If you want to query for persons with their names, the GraphQL query shown on the left side of the figure will be used. This query is processed inside GraphSPARQL by a so-called Parser. The Parser uses the predefined schema to transform the GraphQL query into the SPARQL query. This SPARQL query is then processed by the Query Processor. It handles the connection to the Knowledge Graph. On the right side of the figure, you see the SPARQL query generated based on the GraphQL query. It is pretty confusing compared to the simple GraphQL query. Therefore, we want to hide those queries with our middleware.
As a result of the SPARQL query, the Knowledge Graph responds with something that seems quite cryptic, if you are not familiar with the syntax. You can see an example SPARQL response on the following figure’s right side. This cryptic response is returned to the Parser by the Query Processor. The Parser then, again with the help of the schema, transforms the response into a nice-looking GraphQL response. The result is a JSON containing the result of the initial query.
GraphSPARQL provides you easy access to the information stored in a Knowledge Graph using the simple GraphQL query language.
You have a Knowledge Graph stored in a graph database that is accessible via SPARQL endpoint only? Then GraphSPARQL is the perfect solution for you. Before you can start, you need to follow two configuration steps:
Provide the so-called Enriched GraphQL Schema. This schema can either be created automatically based on a given ontology, e.g., schema.org provides its ontology as a download or can be defined manually. An example for both cases can be found on the GraphSPARQL Github page in the example folder: – automatic creation of a schema based on the DBPedia ontology – manually defined schema
Define the SPARQL endpoint GraphSPARQL should connect to. This can be done in the configuration file (see “config.json” in the example folder).
Have you done both preparation steps? Perfect, now you are ready to use GraphSPARQL on the endpoint you defined. Check the end of the blog post if you are interested in a concrete example.
– What are the benefits of GraphSPARQL? – Benefit from Knowledge Graphs by using a simple query language – Simple JSON syntax for defining queries – Parser support for the JSON syntax of GraphQL queries in nearly all programming languages – GraphQL query structure represents the structure of the expected result – Restrict data access via the provided GraphQL schema
GraphSPARQL as middleware allows querying SPARQL endpoints using GraphQL as a simple query language and is an important step to open Semantic Web Technologies to a broader audience.
Docker container to test GraphSPARQL:
Two options to start the docker container are supported so far:
Use predefined configuration for DBPedia: start the GraphSPARQL docker container
If you work in SEO, you have been reading about Google and Bing becoming semantic search engines but, what does Semantic Search really mean for users, and how things work under the hood?
Semantic Search helps you surface the most relevant results for your users based on search intent and not just keywords.
Semantic (or Neural) Search uses state of the art deep learning models to provide contextual and relevant results to user queries. When we use semantic search we can immediately understand the intent behind our customers and provide significantly improved search results that can drive deeper customer engagement. This can be essential in many different sectors but – here at WordLift – we are particularly interested in applying these technologies to: travel brands, e-commerce and online publishers.
Information is often unstructured and available in different silos, using semantic search our goal is to use machine learning techniques to make sense of content and to create a context. When moving from syntax (for example how often a term appears on a webpage) to semantics, we have to create a layer of metadata that can help machines grasp the concepts behind each word. Google defines this ability to connect words to concepts as “Neural Matching” or *super synonyms* that help better match user queries with web content. Technically speaking this is achieved by using neural embeddings that transform words (or other types of content like images, video or audio clips) to fuzzier representations of the underlying concepts.
As part of the R&D work that we’re doing, in the context of the EU-cofounded project called WordLift Next Generation, I have built the prototype using an emerging open-source framework called Jina AI and the beautiful photographic material published by Salzburgerland Tourismus (also a partner in the Eurostars research project) and Österreich Werbung 🇦🇹 (Austrian National Tourist Office).
I have created this first prototype:
☝️ to understand how modern search engines work;
✌️ to re-use the same #SEO data that @wordliftit publishes as structured *linked* data for internal search.
With semantic search, these capabilities are combined to let users find exactly what they need naturally.
In Jina, Flows are high-level concepts that define a sequence of steps to accomplish a task. Indexing and querying are two separate Flows; inside each flow, we run parallel Pods to analyze the content. A Pod is a basic processing unit in a Flow that can run as a dockerized application.
This is strategic as it allows us to distribute the load efficiently. In this demo, Pods are programmed to create neural embeddings: one pod to processes text and one for images. Pods can also run in parallel and the results (embeddings from the caption and embeddings from the image) are combined into one single document.
This ability to work with different content types is called multi-modality.
The user uses a text in the query to retrieve an image or vice-versa; the user uses an image, in the query, to retrieve its description.
See in the example below; I make a search using natural language at the beginning and right after, I send an image (from the results of the first search) as query to find its description 👇
Are you ready to innovate your content marketing strategy with AI? Let’s talk!
What is Jina AI?
Han Xiao, Jina AI’s CEO, calls Jina the “TensorFlow” for search 🤩. Besides the fact that I love this definition, Jina is completely open source, and designed to help you build neural (or semantic) search on the cloud. Believe me it is truly impressive. To learn more about Jina, watch Han’s latest video on YouTube “Jina 101: Basic concepts in Jina“.
How can we optimize content for Semantic Search?
Here is what I learned from this experiment:
When creating content, we shall focus on concepts (also referred to as entities) and search intents rather than keywords. An entity is a broader concept that groups different queries. The search intent (or user intent) is the user’s goal when making the query to the search engine. This intent can be expressed using different queries. The search engines interpret and disambiguate the meaning behind these queries by using the metadata that we provide.
Information Architecture shall be designed once we understand the search intent. We are used to thinking in terms of 1 page = 1 keyword, but in reality, as we transition from keywords to entities (or concepts), we can cover the same topic across multiple documents. After crawling the pages, the search engine will work with a holistic representation of our content even when it has been written across various pages (or even different media types).
Adding structured data for text, images, and videos adds precious data points that will be taken into account by the search engine. The more we provide high-quality metadata, the more we help the semantic search engine improve the matching between content and user intent.
Becoming an entity in Google’s Knowledge Graph also greatly helps Google understand who we are and what we write about. It can have an immediate impactacross multiple queries that refer to the entity. Read this post to learn more how to create an entity in Google’s graph.
Working with Semantic Search Engines like Google and Bing, require an update of your content strategy and a deep understanding of the principles of Semantic SEO and machine learning.
In the last two decades, text summarization has played an essential role in search engine optimization (SEO). There are, indeed, a lot of different marketing techniques that require a summary of the content, and that can improve ranking in search engines. Meta descriptions are probably among the most notable examples (here is a video tutorial that Andrea did on generating meta descriptions).
These text snippets provide search engines with a brief description of the page content and are still an important ranking factor and one of the most common use cases for text summarization.
Thanks to the latest NLP technologies, SEO specialists can finally summarize the content of entire webpages using algorithms that craft easy-to-read summaries.
In this article, we will discuss the importance of using text summarization in the context of SEO and digital marketing.
Summaries help create and structure the metadata that describes web pages. Text summarization also comes in handy when we want to add descriptive text to category pages for magazines and eCommerce websites or when we need to prepare the copy for promoting our latest article to Facebook or Twitter. Much like search engines use meta descriptions, social networks rely on their meta descriptors like the Facebook Open Graph meta tag (a.k.a. OG tag) to distribute content to their users. Facebook for instance, uses the summary provided in OG tags to create the card that promotes a blog post on mobile and desktop devices.
Extractive vs Abstractive
There are many different text summarization approaches, and they vary depending on the number of input documents, purpose, and output. But, they all fall into one of two categories: extractive or abstractive.
Extractive summarization algorithms identify essential sections of a text and generate verbatim to produce a subset of the sentences from the original input.
Extractive summaries are reliable because they will not change the meaning of any sentence. They are generally easier to program. It’s very logical, and in the most straightforward implementations, the most common words in the source text are the words that represent the main topic. Using today’s pre-trained Transformer models with their ground-breaking performance, we can achieve excellent results with the extractive approach.
In WordLift, for instance, BERT is used to boost the performance of extractive summarization across different languages. Here is the summary that WordLift creates for me for this article that you are reading.
In the last two decades, text summarization has played an essential role in search engine optimization (SEO). While our existing BERT-based summarization API performs well in German, we wanted to create unique content instead of only shrinking the existing text.
It is quite useful in summarizing, using the most important sentences, the content that I am writing here, and it is formally correct (or at least as valid as my writing) but not unique.
Abstractive Text Summarization
Abstractive methodologies summarize texts differently, using deep neural networks to interpret, examine, and generate new content (summary), including essential concepts from the source.
Abstractive approaches are more complicated: you will need to train a neural network that understands the content and rewrites it.
In general, training a language model to build abstract summaries is challenging and tends to be more complicated than using the extractive approach. Abstractive summarization might fail to preserve the meaning of the original text and generalizes less than extractive summarization.
As humans, when we try to summarize a lengthy document, we first read it entirely very carefully to develop a better understanding; secondly, we write highlights for its main points. As computers lack most of our human knowledge, they still struggle to generate summaries’ human-level quality.
Moreover, using abstractive approaches also poses the challenge of supporting multilingualism. The model needs to be trained for each language separately.
Training our new model for German using Google’s T5
As part of the WordLift NG project and, on behalf of one of our German-speaking clients, we ventured into creating a new pre-trained language model for automatic text summarization in German. While our existing BERT-based summarization API performs well in German, we wanted to create unique content instead of only shrinking the existing text.
This new language model is revolutionary as we can re-use the same model for different NLP tasks, including summarization. T5 is also language-agnostic, and we can use it with any language.
We successfully trained the new summarizer on a dataset of 100,000 texts together with reference summaries extracted from the German Wikipedia. Here is a result where we see the input summary that was provided along with the full text of the Wikipedia page on Leopold Wilhelm and the predicted summary generated by the model.
Conclusions and future work
We are very excited about this new line of work and we will continue experimenting new ways to help editors, SEOs and website owners improve their publishing workflows with the help of NLP, knowledge graphs and deep learning!
WordLift provides websites with all the AI you need to grow your traffic — whether you want to increase leads, accelerate sales on your e-commerce or build a powerful website. Let’s talk with our expert to find out more!
Providing your website with a structured content model is not only the best solution to better organize your content, but also a powerful strategy to improve the SEO of your website and increase organic traffic.
In this article, we’ll explain how the WordLift entity-based model, coupled with the new feature WL Mappings, will allow you to add a more specific markup to your content and to obtain a Knowledge Graph capable of communicating to search engines in a more effective way.
Why is the content model becoming a key tool for SEO?
During his webinar on content modeling for SEO in the WordLift Academy, Cruce Saunders highlights some of the main features that make the content model an indispensable tool for managing and enhancing online content.
In fact, the content model:
Specifies how information is organized on your website
Makes content more visible to search engines
Allows you to reuse content through different channels
Structured content model, in short, not only allows you to better organize data and information but to do it within a malleable structure capable of communicating:
to search engines through the use of structured data
to users through the enhancement of the user experience and the possibility of reusing the content by presenting it in the form of different layouts both on the site and in the SERP to respond to specific search intents (the same article, for example, may appear in the form of a snippet, of blue link, promotion, etc.)
Structuring your content model means creating a three-dimensional identity capable of highlighting your content and the relationships underlying it. This allows search engines to recognize you among hundreds of pieces of information, making you more visible to users who correspond to the search intent related to your business.
The more the content model is rich in structured data, the more chances you’ll have to meet exactly the users interested in you. That’s why we created WordLift Mappings, a new feature that allows you to select the information and connections that are truly relevant for your business and to create an increasingly specific and refined Knowledge Graph to highlight only the most relevant facets that make your identity more authoritative.
Through our entity-based model and WordLift Mappings, your content model becomes a powerful SEO weapon and a valuable resource to increase the value of your online data.
WordLift Mappings helps you create a custom Knowledge Graph and increase your online authority
WordLift creates a personalized and highly performing Knowledge Graph through the schema.org markup and the creation of a customized entity-based vocabulary containing the most relevant data to help Google better understand your online content.
Remember that Google uses entities to satisfy users’ search intent and allow them to find the best results. For this reason, an increasingly refined entity-based model such as WordLift is key to increase the visibility of your content for search engines.
WordLift Mappings increases the accuracy of this process and allows you to take greater care of your content and your Knowledge Graph.
By connecting to ACF (Advanced Custom Field), a WordPress plugin that allows you to create advanced fields to specify the attributes that characterize your content, WordLift Mappings allows you to structure your data starting from the fields that you have already configured with ACF or from new fields based on the schema.org taxonomy.
This means having more and more structured content, which can be used to add relevant details to your Knowledge Graph and shaped in different configurations to improve the user experience.
In this webinar, Andrea Volpini and Jason Barnard explain how they used WordLift Mappings to improve Jason’s content model and Knowledge Graph. Jason shows how he obtains a Knowledge Graph in which only the most relevant data to create an authoritarian and coherent Brand SERP are structured to stand out in the search results.
With over 100 podcasts made in collaboration with some of the greatest SEO experts on the planet, Kalicube – Jason Barnard‘s website – has an enviable wealth of content, relevant to the entire digital marketing sector. Thanks to WordLift Mappings, we helped Jason structure this content by following a model that focuses on content, events and people. Thus, each guest of Jason’s podcast has his own page connected with the podcasts in which he participated and with the events in which the podcasts were recorded.
In this way, the architecture of the Knowledge Graph is customized on the basis of the content model and the “network” of the links between the contents becomes the bearer of meanings and allows you to predict further connections. Below you can see the entity-based content model applied to Kalicube through WordLift Mappings.
What can you realistically expect in terms of traffic? How long does it take?
In recent weeks, our SEO team has implemented a new content model on the site of an American customer who deals with the dismantling and re-evaluation of used hardware on a large scale.
The results? After the first week, traffic increased by 14.6% and the growth curve does not seem to stop. To analyze the impact, isolating other factors that may influence the SEO of the site, we have developed a predictive model based on Bayesian networks which, analyzing the traffic in the month preceding the introduction of the WordLift Mappings, allowed us to isolate the benefit to the net of other ranking factors (it’s called causal inference analysis).
Here we see the real clicks in black and the traffic we would have had in blue (that is, the traffic predicted by the mathematical model), then in the following chart the difference between the real traffic and the estimated traffic and finally the delta of increase. In this way, we can be sure that, as analyzed, it has statistical relevance and is related to the introduction of the new content model. ?
In summary, WordLift Mappings allows you to:
Build a Custom Knowledge Graph based on your content model
Improve the SEO of your website through structured data
Shape the structure of your content to improve the user experience
Reuse chunks of content through different configurations to respond to different research purposes
Enhance any type of content composed of reusable elements (articles, courses, events, How-Tos etc.)
The implementation of a custom Knowledge Graph through WordLift Mappings has a positive and measurable impact on traffic.