Named-entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, places, expressions of times, quantities, monetary values, percentages and more.
Most research on NER systems starts with an unannotated block of text, such as this one: “WordLift is a plugin for WordPress” and extracting all relevant information from it:
- WordLift | schema-org:CreativeWork | http://data.redlink.io/91/be9/entity/wordlift
- Plugin | dbc:Software-add_ons | http://dbpedia.org/page/Plug-in_(computing)
- WordPress | dbc:Content_management_software | http://dbpedia.org/page/WordPress.
Let’s get into more details as this is one of the key technologies of WordLift:
First and foremost Named-entity recognition (NER) uses a KB (Knowledge Base) that contains all known concepts (Named Entities) that needs to be extracted from a block of text.
WordLift derives semantic information from the user’s content by leveraging on freely available datasets such as DBpedia and the user’s local vocabulary.
As new concepts are added in the local vocabulary, WordLift learns the knowledge domain of the user and improve its understanding of the content.
WordLift uses a sophisticated ‘name-entity disambiguation‘ (NED) mechanism to correctly detected locations, company and people to unique “instances” in the web of data.
During the extraction phase low level NLP functions take place including POS (part of speech) tagging, tokenisation, sentence boundary detection, capitalization rules and in-document coreference.
As result of the extraction WordLift proposes to the user a set of candidate kb entities for a mention.