Diana Maynard

Natural Language Processing for the Semantic Web


Скачать книгу

of the above, such as authors, music bands, football teams, TV programs, and so on. NERC is the starting point for many more complex applications and tasks such as ontology building, relation extraction, question answering, information extraction, information retrieval, machine translation, and semantic annotation. With the advent of open information extraction scenarios focusing on the whole of the web, analysis of social media where new entities emerge constantly, and named entity linking tasks, the range of entities extracted has widened dramatically, which has brought many new challenges (see for example Section 4.4, where the role of knowledge bases for Named Entity Linking is discussed). Furthermore, the standard kind of 5- or 7-class entity recognition problem is now often less useful, which in turn means that new paradigms are required. In some cases, such as the recognition of Twitter user names, the distinction between traditional classes, such as Organization and Location, has become blurred even for a human, and is no longer always useful (see Chapter 8).

      Defining what exactly should constitute each entity type is never easy, and guidelines differ according to the task. Traditionally, people have used the standard guidelines from the evaluations, such as MUC and CONLL, since these allow methods and tools to be compared with each other easily. However, as tools have been used for practical purposes in real scenarios, and as the types of named entities have consequently changed and evolved, so the ways in which entities are defined have also had to be adapted for the task. Of course, this now makes comparison and performance evaluation more difficult. The ACE evaluation [27], in particular, attempted to solve some of the problems caused by metonymy, where an entity which theoretically depicts one type (e.g., Organization) is used figuratively. Sports teams are an example of this, where we might use the location England or Liverpool to mean their football team (e.g., England won the World Cup in 1966). Similarly, locations such as The White House or 10 Downing Street can be used to refer to the organization housed there (The White House announced climate pledges from 81 countries.). Other decisions involve determining, for example, if the category Person should include characters such as God or Santa Claus, and furthermore, if so, whether they should be included in all situations, such as when using God and Jesus as part of profanities.

      As mentioned above, the first major evaluation series for NERC was MUC, which first addressed the named entity challenge in 1996. The aim of this was to recognize named entities in newswire text, and led not only to system development but the first real production of gold standard NE-annotated corpora for training and testing. This was followed in 2003 by ConLL [28], another major evaluation compaign, providing gold standard data for newswire not only in English but also Spanish, Dutch, and German. The corpus produced for this evaluation effort is now one of the most popular gold standards for NERC, with NERC software releases typically quoting performance on it.

      Other evaluation campaigns later started to address NERC for genres other than newswire, specifically ACE [27] and OntoNotes [29], and introduced new kinds of named entities. Both of those corpora contain subcorpora with the genres newswire, broadcast news, broadcast conversation, weblogs, and conversational telephone speech. ACE additionally contains a subcorpus with usenet newsgroups, and addressed not only English but also Arabic and Chinese in later editions. Both ACE and OntoNotes also involved tasks such as coreference resolution, relation and event extraction, and word sense disambiguation, allowing researchers to study the interaction between these tasks. These tasks are addressed in Section 3.5 and in Chapters 4 and 5.

      While NERC corpora mostly use the traditional entity types, such as Person, Organization and Location, which are not motivated by a concrete Semantic Web knowledge base (such as DBpedia, Freebase, or YAGO), these types are very general. This means that when developing NERC approaches on those corpora for Semantic Web purposes, it is relatively easy to build on top of them and to include links to a knowledge base later. For example, NERD [30] uses an OWL ontology1 containing the set of mappings of all entity categories (e.g., criminal is a sub-class of Person in the NERD ontology).

      One of the main challenges of NERC is to distinguish between named entities and entities. The difference between these two things is that named entities are instances of types (such as Person, Politician) and refer to real-life entities which have a single unique referent, whereas entities are often groups of NEs which do not refer to unique referents in the real world. For example, “Prime Minister” is an entity, but it is not a named entity because it refers to any one of a group of named entities (anyone who has been or currently is a prime minister). It is worth noting though that the distinction can be very difficult to make, even for humans, and annotation guidelines for tasks differ on this.

      Another challenge is to recognize NE boundaries correctly. In Example 3.1, it is important to recognize that Sir is part of the name Sir Robert Walpole. Note that tasks also differ in where they place the boundaries. MUC guidelines define that a Person entity should include titles; however, other evaluations may define their tasks differently. A good discussion of the issues in designing NERC tasks, and the differences between them, can be found in [31]. The entity definitions and boundaries are thus often not consistent between different corpora. Sometimes, boundary recognition is considered as a separate task from detecting the type (Person, Location, etc.) of the named entity. There are several annotation schemes commonly used to recognize where NEs begin and end. One of the most popular ones is the BIO schema, where B signifies the Beginning of an NE, I signifies that the word is Inside an NE, and O signifies that the word is just a regular word Outside of an NE. Another very popular scheme is BILOU [32], which has the additional labels L (Last word of an NE) and U (Unit, signifying that the word is an entire unit, i.e., NE).

      Example 3.1 Sir Robert Walpole was a British statesman who is generally regarded as the first Prime Minister of Great Britain. Although the exact dates of his dominance are a matter of scholarly debate, 1721-1742 are often used.2

      Politician: Government positions held (Officeholder, Office/position/title, From, To)

      Person: Gender

      Sir Robert Walpole: Politician, Person

      Government positions held (Sir Robert Walpole, Prime Minister of Great Britain, 1721, 1742)Gender (Sir Robert Walpole, male)

      Ambiguities are one of the biggest challenges for NERC systems. These can affect both the recognition and the classification component, and sometimes even both simultaneously. For example, the word May can be a proper noun (named entity) or a common noun (not an entity, as in the verbal use you may go), but even when a proper noun, it can fall into various categories (month of the year, part of a person’s name (and furthermore a first name or surname), or part of an organization name). Very frequent categorization problems occur with the distinction between Person and Organization, since many companies are named after people (e.g., the clothing company Austin Reed). Similarly, many things which may not be named entities, such as names of diseases and laws, are named after people too. While technically one could annotate the person’s name here, it is not usually desirable (we typically do not care about annotating Parkinson as a Person in the term Parkinson’s disease or Pythagoras in Pythagoras’ Theorem).

      Temporal normalization takes the recognition of temporal expressions (NEs classified as Date or Time) a step further, by mapping them onto a standard date and time format. Temporal normalization, and in particular that of relative dates and times, is critical for event recognition tasks. The task is quite easy if a text already refers to time in an absolute way, e.g., “8am.” It becomes more challenging, however, if a text refers to time in a relative way, e.g., “last week.” In this