DBpedia has served as a Unified Access Platform for the data in Wikipedia for over a decade. During that time DBpedia has established many of the best practices for publishing data on the web. In fact, that is the project that hosted a knowledge graph even before Google coined the term. For the past 10 years, they were “extracting and refining useful information from Wikipedia”, and are expert in that field. However, there was always a motivation to extend this with other data and allow users unified access. The community, the board, and the DBpedia Association felt an urge to innovate the project. They were re-envisioning DBpedia’s strategy in a vital discussion for the past two years resulting in new mission statement: “global and unified access to knowledge graphs”.
Last September, during the SEMANTiCS Conference in Vienna, Andrea Volpini and David Riccitelli had a very interesting meeting with Dr. Ing. Sebastian Hellmann from the University of Leipzig, who sits on the board of DBpedia. The main topic of that meeting was the DBpedia Databus since we at WordLift are participating as early adopters. It is a great opportunity to add links from DBpedia to our knowledge graph. On that occasion, Andrea asked Sebastian Hellmann to participate in an interview, and he kindly accepted the call. These are the questions we asked him.
Sebastian Hellmann is head of the “Knowledge Integration and Language Technologies (KILT)” Competence Center at InfAI. He also is the executive director and board member of the non-profit DBpedia Association. Additionally, he is a senior member of the “Agile Knowledge Engineering and Semantic Web” AKSW research center, focusing on semantic technology research – often in combination with other areas such as machine learning, databases, and natural language processing. Sebastian is a contributor to various open-source projects and communities such as DBpedia, NLP2RDF, DL-Learner and OWLG, and has been involved in numerous EU research projects.
How DBpedia and the Databus are planning to transform linked data in a networked data economy?
We have published data regularly and already achieved a high level of connectivity in the data network. Now, we plan a hub, where everybody uploads data. In that hub, useful operations like versioning, cleaning, transformation, mapping, linking, merging, hosting are done automatically and then again dispersed in a decentral network to the consumers and applications. Our mission incorporates two major innovations that will have an impact on the data economy.
Providing global access
That mission follows the agreement of the community to include their data sources into the unified access as well as any other source. DBpedia has always accepted contributions in an ad-hoc manner, and now we have established a clear process for outside contributions.
Incorporating “knowledge graphs” into the unified access
That means we will reach out to create an access platform not only to Wikipedia (DBpedia Core) but also Wikidata and then to all other knowledge graphs and databases that are available.
The result will be a network of data sources that focus on the discovery of data and also tackles the heterogeneity (or in Big Data terms Variety) of data.
What is DBpedia Databus?
The DBpedia Databus is part of a larger strategy following the mission to provide “Global and Unified Access to knowledge”. The DBpedia Databus is a decentralized data publication, integration, and subscription platform.
- Publication: Free tools enable you to create your own Databus-stop on your web space with standard-compliance metadata and clear provenance (private key signature).
- Integration: DBpedia will aggregate the metadata and index all entities and connect them to clusters.
- Subscription: Metadata about releases are subscribable via RSS and SPARQL. Entities are connected to Global DBpedia Identifiers and are discoverable via HTML, Linked Data, SPARQL, DBpedia releases and services.
DBpedia is a giant graph and the result of an amazing community effort – how is the work being organized these days?
DBpedia’s community has two orthogonal, but synergetic motivations:
- Build a public information infrastructure for greater societal value and access to knowledge;
- Business development around this infrastructure to drive growth and quality of data and services in the network.
The main motivation is to be finally able to discover and use data easily. Therefore, we are switching to the Databus platform. The DBpedia Core releases (Extraction from Wikidata and Wikipedia) are just one of many datasets that are published via the Databus platform in the future. One of the many innovations here is that DBpedia Core releases are more frequent and more reliable. Any data provider can benefit from the experience we gained in the last decade by publishing data like DBpedia does and connect better to users.
We’re planning to give our WordLift users the option to join the DBpedia Databus. What are the main benefits of doing so?
The new infrastructure allows third parties to publish data in the same way as DBpedia does. As a data provider, you can submit your data to DBpedia and DBpedia will build an entity index over your data. The main benefit of this index is that your data becomes discoverable. DBpedia acts as a transparent middle-layer. Users can query DBpedia and create a collection of entities they are interested in. For these sets, we will provide links to your data, so that users can access them at the source.
For data providers our new system has three clear-cut benefits:
- Their data is advertised and receives more attention and traffic redirects;
- Once indexed, DBpedia will be able to send linking updates to data providers, therefore aiding in data integration;
- The links to the data will disseminate in the data network and generate network-wide integration and backlinks.
Publishing data with us means connecting and comparing your data to the network. In the end, DBpedia is the only database you need to connect with to in order to get global and unified access to knowledge graphs.
DBpedia and Wikidata both publish entities based on Wikipedia and both use RDF and the semantic web stack. They do fulfill quite different tasks though. Can you tell us more about how DBpedia is different from Wikidata and how these two will co-evolve in the next future?
As a knowledge engineer, I have learned a lot by analyzing the data acquisition processes of Wikidata. In the beginning, the DBpedia community was quite enthusiastic to submit DBpedia’s data back to Wikimedia via Wikidata. After trying for several years, we had to find out that it is not as easy to contribute data in bulk directly to Wikidata as the processes are volunteer-driven and allow only small-scale edits or bots. Only a small percentage of Freebase’s data was ingested. They follow a collect and copy approach, which ultimately inspired the sync-and-compare approach of the Databus.
Data quality and curation follow the Law of Diminishing Returns in a very unforgiving curve. In my opinion, Wikidata will struggle with this in the future. Doubling the volunteer manpower will improve quantity and quality of data by dwindling, marginal percentages. My fellow DBpedians and I have always been working with other people’s data and we have consulted hundreds of organizations in small and large projects. The main conclusion here is that we are all sitting in the same boat with the same problem. The Databus allows every organization to act as a node in the data network (Wikidata is also one node thereof). By improving the accessibility of data, we open the door to fight the law of diminishing returns. Commercial data providers can sell their data and increase quality with income; public data curators can sync, reuse and compare data and collaborate on the same data across organizations and effectively pool manpower.