Diana Inkpen

Natural Language Processing for Social Media


Скачать книгу

Web. We presented semantic analysis in social media as a new opportunity for big data analytics and for intelligent applications. Social media monitoring and analyzing of the continuous flow of user-generated content can be used as an additional dimension which contains valuable information that would not have been available from traditional media and newspapers. In addition, we mentioned the challenges with social media data, which are due to their large size, and to their noisy, dynamic, and unstructured nature.

       1 http://people.eng.unimelb.edu.au/tbaldwin/pubs/starsem2014.pdf

       2 http://www.statista.com/

       3 http://www.cision.com/uk/files/2013/10/social-journalism-study-2013.pdf

       4 https://aclweb.org/anthology/W/W14/#1300

       5 https://aclweb.org/anthology/W/W13/#1100

       6 https://aclweb.org/anthology/W/W12/#2100

       7 https://aclweb.org/anthology/W/W12/#2100

       8 https://aclweb.org/anthology/W/W11/#0700

       9 http://www.scc.lancs.ac.uk/microposts2015/

       10 http://www.scc.lancs.ac.uk/microposts2014/

       11 http://oak.dcs.shef.ac.uk/msm2013/

       12 htpp://ceur-ws.org/Vol-838/

       13 htpp://ceur-ws.org/Vol-718/

       14 https://sites.google.com/site/socialnlp/2nd-socialnlp-workshop

       15 https://sites.google.com/site/socialnlp/1st-socialnlp-workshop

      CHAPTER 2

       Linguistic Pre-processing of Social Media Texts

      In this chapter, we discuss current Natural Language Processing (NLP) linguistic pre-processing methods and tools that were adapted for social media texts. We survey the methods used for adaptation to this kind of texts. We briefly define the evaluation measures used for each type of tool in order to be able to mention the state-of-the-art results.

      In general, evaluation in NLP can be done in several ways:

      • manually, by having humans judge the output of each tool;

      • automatically, on test data that humans have annotated with the expected solution ahead of time; and

      • task-based, by using the tools in a task and evaluating how much they contribute to the success in the task.

      We primarily focus on the second approach here. It is the most convenient since it allows the automatic evaluation of the tools repeatedly after changing/improving their methods, and it allows comparing different tools on the same test data. Care should be taken when human judges annotate data. There should be at least two annotators that are given proper instructions on what and how to annotate (in an annotation manual). There needs to be a reasonable agreement rate between the two or more annotators, to ensure the quality of the obtained data. When there are disagreements, the expected solution will be obtained by resolving the disagreements by taking a vote (if there are three annotators or more, an odd number), or by having the annotators discuss until they reach an agreement (if there are only two annotators, or an even number). When reporting the inter-annotator agreement for a dataset, the kappa statistic also needs to be reported, in order to compensate the obtained agreement for possible agreements due to chance [Artstein and Poesio, 2008, Carletta, 1996].

      NLP tools often use supervised machine learning, and the training data are usually annotated by human judges. In such cases, it is convenient to keep aside some of the annotated data for testing and to use the remaining data to train the models. Many of the methods discussed in this book use machine learning algorithms for automatic text classification. That is why we give a very brief introduction here. See, e.g., [Witten and Frank, 2005] for details of the algorithms and [Sebastiani, 2002] for how they can be applied to text data.

      A supervised text classification model predicts the label c of an input x, where x is a vector of feature values extracted from document d. The class c can take two or more possible values from a specified set (or even continuous numeric values, in which case the classifier is called a regression model). The training data contain document vectors for which the classes are provided. The classifier uses the training data to learn associations between features or combinations of features that are strongly associated with one of the classes but not with the other classes. In this way, the trained model can make predictions for unseen test data in the future. There are many classification algorithms. We name three classifiers most popular in NLP tasks.

      Decision trees take one feature at a time, compute its power of discriminating between the classes and build a tree with the most discriminative features in the upper part of the tree; decision trees are useful because the models can be easily understood by humans. Naïve Bayes is a classifier that learns the probabilities of association between features and classes; these models are used because they are known to work well with text data (see a more detailed description in Section 2.8.1). Support Vector Machines (SVM) compute a hyper plane that separates two classes and they can efficiently perform nonlinear classification using what is called a kernel to map the data into a high-dimensional feature space where it become linearly separable [Cortes and Vapnik, 1995]; SVMs are probably the most often used classifiers due to their high performance on many tasks.

      A sequence-tagging model can be seen as a classification model, but fundamentally differs from a conventional one, in the sense that instead of dealing with a single input x and a single label c each time, it predicts a sequence of labels c = (c1, c2,…, cn) based on a sequence of inputs x = (x1, x2,…, xn) and the predictions from the previous steps. It was applied with success in natural language processing (for sequential data such as sequences of part-of-speech tags, discussed in the previous chapter) and in bioinformatics (for DNA sequences). There exist a number of sequence-tagging models, including Hidden Markov Model (HMM) [Baum and Petrie, 1966], Conditional Random Field (CRF) [Lafferty et al., 2001], and Maximum Entropy Markov Model (MEMM) [Berger et al., 1996].

      The remainder of this chapter is structured as follows. Section 2.2 discusses generic methods of adapting NLP tools to social media texts. The next five sections discuss NLP tools of interest: tokenizers, part-of-speech taggers, chunkers, parsers, and named entity recognizers, as well as adaptation techniques for each. Section 2.7 enumerates the existing toolkits that were adapted to social media texts in English. Section 2.8 discusses multi-lingual aspects and language identification issues in social media. Section 2.9 summarizes this chapter.

      NLP tools are important because they need to be used before we can build any applications that aim to understand texts or extract useful information from texts. Many NLP tools are now available, with acceptable levels of accuracy on texts that are similar to the types of texts used for training the models embedded in these tools. Most of the tools are trained on carefully edited texts, usually newspaper texts, due to the wide availability of these kinds of texts. For example, the Penn TreeBank