Diana Maynard

Natural Language Processing for the Semantic Web


Скачать книгу

learning approaches, which learn weights for features, based on their probability of appearing with negative vs. positive training examples for specific NE types. The general supervised learning approach consists of five stages:

      • linguistic pre-processing;

      • feature extraction;

      • training models on training data;

      • applying models to test data;

      • post-processing the results to tag the documents.

      Linguistic pre-processing at the minimal level includes tokenization and sentence splitting. Depending on the features used, it can also include morphological analysis, part-of-speech tagging, co-reference resolution, and parsing, as described in Chapter 2. Popular features include:

      • Morphological features: capitalization, occurrence of special characters (e.g., $, %);

      • Part-of-speech features: tags of the occurrence;

      • Context features: words and POS of words in a window around the occurrence, usually of 1–3 words;

      • Gazetteer features: appearance in NE gazetteers;

      • Syntactic features: features based on parse of sentence;

      • Word representation features: features based on unsupervised training on unlabeled text using e.g., Brown clustering or word embeddings.

      Statistical NERC approaches use a variety of models, such as Hidden Markov Models (HMMs) [51], Maximum Entropy models [52], Support Vector Machines (SVMs) [53] [54] [55], Perceptrons [56][57], Conditional Random Fields (CRFs) [58, 59], or neural networks [60]. The most successful NERC approaches include those based on CRFs and, more recently, multilayer neural networks. We refer readers interested in learning more about those machine learning algorithms to [61, 62].

      CRFs model NERC as a sequence labeling approach, i.e., the label for a token is modeled as dependent on the label of preceding and following tokens in a certain window. Examples of frameworks which are available for CRF-based NERC are Stanford NER3 and CRFSuite.4 Both are distributed with feature extractors and models trained on the ConLL 2003 data [28].

      Multi-layer neural network approaches have two advantages. First, they learn latent features, meaning they do not require linguistic processing beyond sentence splitting and tokenization. This makes them more robust across domains than architectures based on explicit features, since they do not have to compensate for mistakes made during pre-processing. Second, they can easily incorporate unlabeled text, on which representation feature extraction methods can be trained. The state-of-the-art system for NERC, SENNA [60], uses such a multi-layer neural network architecture with unsupervised pre-training. It is available as a stand-alone distribution5 or as part of the DeepNL framework.6 Like the frameworks above, it is distributed with feature extractors and offers functionality for training models on new data.

      There are advantages and disadvantages to a supervised learning approach for NERC compared with a knowledge engineering, rule-based approach. Both require manual effort—rule-based approaches require specialist language engineers to develop hand-coded rules, whereas supervised learning approaches require annotated training data, for which language engineers are not needed. Which stream of approach is better suited for an application scenario is dependent on the application and the domain. For popular domains, such as newswire, hand-labeled training data is already available, whereas for others, it might need to be created from scratch. If the linguistic variation in the text is very small and quick results are desired, hand-coding rules might be a better starting point.

      GATE’s general purpose named entity recognition and classification system, ANNIE, is a typical example of a rule-based system. It was designed for traditional NERC on news texts but, being easily adaptable, can form also the starting point for new NERC applications in other languages and for other domains. GATE contains tools for ML, so can be used to train models for NERC also, based on the pre-processing components described in Chapter 2. Other well known systems are UIMA,7 developed by IBM, which focuses more on architectural support and processing speed, and offers a number of similar resources to GATE; OpenCalais,8 which provides a web service for semantic annotation of text for traditional named entity types, and LingPipe9 which provides a (limited) set of machine learning models for various tasks and domains. While these are very accurate, they are not easily adaptable to new applications. Components from all these tools are actually included in GATE, so that a user can mix and match various resources as needed, or compare different algorithms on the same corpus. However, the components provided are mainly in the form of pre-trained models, and do not typically offer the full functionality of the original tools.

      The Stanford NER package, included in the Stanford CoreNLP pipeline, is a Java implementation of a Named Entity Recognizer. It comes with well-engineered feature extractors for NERC, and has a number of options for defining these. In addition to the standard 3-class model (Person, Organization, Location), it also comes with other models for different languages and models trained on different sets. The methodology used is a general implementation of linear chain Conditional Random Field (CRF) sequence models, and thus the user can easily retrain it on any labeled data they have. The Stanford NER package is also used in NLTK, which does not have its own NERC tool.

      OpenNLP contains a NameFinder module for English NERC which has separate models for the standard 7-type MUC classification (Person, Organization, Location, Date, Time, Money, Percent), trained on standard freely available datasets. It also has models for Spanish and Dutch trained on CONLL data. As with the Stanford NER tool, the user can easily retrain the NameFinder on any labeled dataset. Similarly to the other learning-based tools mentioned above, because they rely on supervised learning, these tools work well only when large amounts of annotated training data are available, so applying them to new domains and text types can be quite problematic if such data does not exist.

      An example of a system that performs fine-grained NERC is FIGER [63],10 which is trained on Wikipedia. The tag set for FIGER is made up of 112 types, which are derived from Freebase by selecting the most frequent types and merging fine-grained types. The goal is to perform multi-class multi-label classification, i.e., each sequence of words is assigned one or several of multiple types, or no type. Training data for FIGER is created by exploiting the anchor text of entity mentions annotated in Wikipedia, i.e., for each sequence of words in a sentence, the sequence is automatically mapped to a set of Freebase types and used as positive training data for those types. The system is trained using a two-step process: training a CRF model for named entity boundary recognition, then an adapted perceptron algorithm for named entity classification. Typically, a CRF model would be used for doing both at once (e.g. [64]), but this is avoided here due to the large set of NE types. As for the other NERC tools, it can easily be retrained on new data.

      Research on NERC in tweets is currently a hot research area, since there are many tasks which rely on the analysis of social media, as we will discuss in Chapter 8. Social media is a particular challenge for NERC due to its noisy nature (incorrect spelling, punctuation, capitalization, novel use of words, etc.), which affects both the pre-processing components required (and thus has a knock-on effect on the NERC component performance) and the named entities themselves, which become harder to recognize. Due to the lack of annotated corpora, performing NERC on social media data using a learning approach is generally viewed as a domain adaptation problem from newswire text, often integrating the two kinds of data for training [65] and including a tweet normalization step [66]. One particular challenge is recency: the kinds of NEs that we want to recognize in social media are often newly emerging (recent news stories about people who were not previously famous, for