would not be able to distinguish between these two cases. In this case, a learning-based model might do better than a rule-based approach.
GATE provides both NP and VP chunker implementations. The NP Chunker is a Java implementation of the Ramshaw and Marcus BaseNP chunker [22], which is based on their POS tags and uses transformation-based learning. The output from this version is identical to the output of the original C++/Perl version.
The GATE VP chunker is written in JAPE, GATE’s rule-writing language, and is based on grammar rules for English [23, 24]. It contains rules for the identification of non-recursive verb groups, covering finite (is investigating), non-finite (to investigate), participles (investigated), and special verb constructs (is going to investigate). All the forms may include adverbials and negatives. One advantage of this tool is that it explicitly marks negation in verbs (e.g., don’t, which is extremely useful for other tasks such as sentiment analysis. The rules make use of POS tags as well as some specific strings (e.g., the word might is used to identify modals).
OpenNLP’s chunker uses a pre-packaged English maximum entropy model. Unlike GATE, whose two chunkers are independent, it analyses the text one sentence at a time and produces both NP and VP chunks in one go, based on their POS tags. The OpenNLP chunker is easily retrainable, making it easy to adapt to new domains and text types if one has a suitable pre-annotated corpus available.
NLTK and Stanford CoreNLP do not provide any chunkers, although they could be created using rules and/or machine learning from the other components (such as POS tags) in the relevant toolkit.
2.10 SUMMARY
In this chapter we have introduced the idea of an NLP pipeline and described the main components, with reference to some of the widely used open-source toolkits. It is important to note that while performance in these low-level linguistic processing tasks is generally high, the tools do vary in performance, not just in accuracy, but also in the way in which they perform the tasks and their output, due to adhering to different linguistic theories. It is therefore critical when selecting pre-processing tools to understand what is required by other tools downstream in the application. While mixing and matching of some tools is possible (particularly in frameworks such as GATE, which are designed precisely with interoperability in mind), compatibility between different components may be an issue. This is one of the reasons why there are several different toolkits available offering similar but slightly different sets of tools. On the performance side, it is also important to be aware of the effect of changing domain and text type, and whether the tools are easily modifiable or not if this is necessary. In particular, moving from tools trained on standard newswire to processing social media text can be problematic; this is discussed in detail in Chapter 8. Similarly, some tools can be adapted easily to new languages (in particular, the first components in the chain such as tokenizers), while more complex tools such as parsers may be more difficult to adapt. In the following chapter, we introduce the task of Named Entity Recognition and show how the linguistic processing tools described in this chapter can be built on to accomplish this.
1
http://opennlp.apache.org/index.html
2
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html
3
http://gate.ac.uk
4A good explanation of Unicode can be found at http://www.unicode.org/standard/WhatIsUnicode.html.
5
http://nlp.stanford.edu/software/tokenizer.shtml
6
http://www.nltk.org/
7
http://www.cs.ualberta.ca/~lindek/minipar.htm
8
http://nlp.stanford.edu/software/srparser.shtml
CHAPTER 3
Named Entity Recognition and Classification
3.1 INTRODUCTION
As discussed in Chapter 1, information extraction is the process of extracting information from unstructured text and turning it into structured data. Central to this is the task of named entity recognition and classification (NERC), which involves the identification of proper names in texts (NER), and their classification into a set of predefined categories of interest (NEC). Unlike the pre-processing tools discussed in the previous chapter, which deal with syntactic analysis, NERC is about automatically deriving semantics from textual content. The traditional core set of named entities, developed for the shared NERC task at MUC-6 [25], comprises Person, Organization, Location, and Date and Time expressions, such as Barack Obama, Microsoft, New York, 4th July 2015, etc.
NERC is generally an annotation task, i.e., to annotate a text with named entities (NEs), but it can involve simply producing a list of NEs which may then be used for other purposes, including creating or extending gazetteers to assist with the NE annotation process in future. It can be subdivided into two tasks: the recognition task, involving identifying the boundaries of an NE (typically referred to as NER); and named entity classification (NEC), involving detecting the class or type of the NE. Slightly confusingly, NER is often used to mean the combination of the two tasks, especially in older work; here we stick to using NERC for the combined task and NER for only the recognition element. For more fine-grained NEC than the standard Person, Organization, and Location classification, classes are often taken from an ontology schema and are subclasses of these [26]. The main challenge for NEC is that NEs can be highly ambiguous (e.g., “May” can be a person’s name or a month of the year; “Mark” can be a person’s name or a common noun). Partly for this reason, the two tasks of NER and NEC are typically solved as a single task.
A further task regarding named entities is named entity linking (NEL). The NEL task is to recognize if a named entity mention in a text corresponds to any NEs in a reference knowledge base. A named entity mention is an expression in the text referring to a named entity: this may be under different forms, e.g., “Mr. Smith” and “John Smith” are both mentions (textual representations) of the same real-world entity, expressed by slightly different linguistic realizations. The reference knowledge base used is typically Wikipedia. NEL is even more challenging than NEC because distinctions do not only have to be made on the class-level, but also within classes. For example, there are many persons with the name “John Smith.” The more popular the names are, the more difficult the NEL task becomes. A further problem, which all knowledge base–related tasks have, is that knowledge bases are incomplete; for example, they will only contain the most famous people named “John Smith.” This is particularly challenging when working on tasks involving recent events, since there is often a time lag between newly emerging entities appearing in the news or on social media and the updating of knowledge bases with their information. More details on named entity linking, along with relevant reference corpora, are given in Chapter 5.
3.2 TYPES OF NAMED ENTITIES
The reason that Person, Organization, Location, Date, and Time have become so popular as standard types of named entity is due largely to the Message Understanding Conference series (MUC) [25], which introduced the Named Entity Recognition and Classification task in 1995 and which drove the initial development of many systems which are still in existence today. Due to the expansion of NERC evaluation efforts (described in more detail in Section 3.3) and the need for using NERC tools in real-life applications, other kinds of proper nouns and expressions gradually also started to be considered as named entities, according to the task, such as newspapers, monetary amounts, and more fine-grained