is GATE. A new module or plugin called TwitIE21 is available [Derczynski et al., 2013a] for tokenization of Twitter texts, as well as POS tagging, name entities recognition, etc.
Two new toolkits were built especially for social media texts: the TweetNLP tools developed at CMU and the Twitter NLP tools developed at the University of Washington (UW).
TweetNLP is a Java-based tokenizer and part-of-speech tagger for Twitter text [Owoputi et al., 2013]. It includes training data of manually labeled POS annotated tweets (that we noted above), a Web-based annotation tool, and hierarchical word clusters from unlabeled tweets.22 It also includes the TweeboParser mentioned above.
The UW Twitter NLP Tools [Ritter et al., 2011] contain the POS tagger and the annotated Twitter data (mentioned above—see adaptation of POS taggers).23
A few other tools for English are in development, and a few tools for other languages have been adapted or can be adapted to social media text. The development of the latter is slower, due to the difficulty in producing annotated training data for many languages, but there is progress. For example, a treebank for French social media texts was developed by Seddah et al. [2012].
2.8 MULTI-LINGUALITY AND ADAPTATION TO SOCIAL MEDIA TEXTS
Social media messages are available in many languages. Some messages could be mixed, for example part in English and part in another language. This is called “code switching.” If tools for multiple languages are available, a language identification tool needs to be run on the texts before using the right language-specific tools for the next processing steps.
2.8.1 LANGUAGE IDENTIFICATION
Language identification can reach very high accuracy for long texts (98–99%), but it needs adaptation to social media texts, especially to short texts such as Twitter messages.
Derczynski et al. [2013b] showed that language identification accuracy decreases to around 90% on Twitter data, and that re-training can lead to 95–97% accuracy levels. This increase is easily achievable for tools that classify into a small number of languages, while tools that classify into a large number of languages (close to 100 languages) cannot be further improved on short informal texts. Lui and Baldwin [2014] tested six language identification tools and obtained the best results on Twitter data by majority voting over three of them, up to an F-score of 0.89.
Barman et al. [2014] presented a new dataset containing Facebook posts and comments that exhibit code mixing between Bengali, English, and Hindi. The researchers demonstrated some preliminary word-level language identification experiments using this dataset. The methods surveyed included a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labeling using Conditional Random Fields. The preliminary results demonstrated the superiority of supervised classification and sequence labeling over dictionary-based classification, suggesting that contextual clues are necessary for accurate classifiers. The CRF model achieved the best result with an F-score of 0.95.
There is a lot of work on language identification in social media. Twitter has been a favorite target, and a number of papers deal with language identification of Twitter messages specifically Bergsma et al. [2012], Carter et al. [2013], Goldszmidt et al. [2013], Mayer [2012], Tromp and Pechenizkiy [2011]. Tromp and Pechenizkiy [2011] proposed a graph-based n-gram approach that works well on tweets. Lui and Baldwin [2014] looked specifically at the problem of adapting existing language identification tools to Twitter messages, including challenges in obtaining data for evaluation, as well as the effectiveness of proposed strategies. They tested several tools on Twitter data (including a newly collected corpus for English, Japanese, and Chinese). The tests were done with off-the-shelf tools, before and after a simple cleaning of the Twitter data, such as removing hashtags, mentions, emoticons, etc. The improvement after the cleaning was small. Bergsma et al. [2012] looked at less common languages, in order to collect language-specific corpora. The nine languages they focused on (Arabic, Farsi, Urdu, Hindi, Nepali, Marathi, Russian, Bulgarian, Ukrainian) use three different non-Latin scripts: Arabic, Devanagari, and Cyrillic. Their method for language identification was based on language models.
Most of the methods used only the text of the message, but Carter et al. [2013] also looked at the use of metadata, an approach which is unique to social media. They identified five microblog characteristics that can help in language identification: the language profile of the blogger, the content of an attached hyperlink, the language profile of other users mentioned in the post, the language profile of a tag, and the language of the original post, if the post is a reply. Further, they presented methods that combine the prior language class probabilities in a post-dependent and post-independent way. Their test results on 1,000 posts from 5 languages (Dutch, English, French, German, and Spanish) showed improvements in accuracy by 5% over the baseline, and showed that post-dependent combinations of the priors achieved the best performance.
Taking a broader view of social media, Nguyen and Doğruöz [2013] looked at language identification in a mixed Dutch-Turkish Web forum. Mayer [2012] considered language identification of private messages between eBay users.
Here are some of the available tools for language identification.
• langid.py24 [Lui and Baldwin, 2012] works for 97 languages and uses a feature set selected from multiple sources, combined via a multinomial Naïve Bayes classifier.
• CLD2,25 the language identifier embedded in the Chrome Web browser,26 uses a Naïve Bayes classifier and script-specific tokenization strategies.
• LangDetect27 is a Naïve Bayes classifier, using a representation based on character n-grams without feature selection, with a set of normalization heuristics.
• whatlang [Brown, 2013] uses a vector-space model with per-feature weighting over character n-grams.
• YALI28 computes a per-language score using the relative frequency of a set of byte n-grams selected by term frequency.
• TextCat29 is an implementation of the method of Cavnar and Trenkle [1994] and it uses an adhoc rank-order statistic over character n-grams.
Only some of the available tools were trained directly on social media data.
• LDIG30 is an off-the-shelf Java language identification tool targeted specifically at Twitter messages. It has pre-trained models for 47 languages. It uses a document representation based on data structures named tries.31
• MSR-LID [Goldszmidt et al., 2013] is based on rank-order statistics over character n-grams, and Spearman’s coefficient to measure correlations. Twitter-specific training data was acquired through a bootstrapping approach.
Some datasets of social media texts annotated with language labels are available.
• The dataset of Tromp and Pechenizkiy [2011] contains 9,066 Twitter messages labeled with one of the six languages: German, English, Spanish, French, Italian, and Dutch.32
• The Twituser language identification dataset33 of Lui and Baldwin [2014] for English, Japanese, and Chinese.
2.8.2 DIALECT IDENTIFICATION
Sometimes it is not enough that a language has been identified correctly. A case in point is Arabic. It is the official language in 22 countries, spoken by more than 350 million people worldwide.34 Modern Standard Arabic (MSA) is the written form of Arabic used in education; it is also the formal communication language. Arabic dialects or colloquial languages are spoken varieties of Arabic, and spoken daily by Arab people. There are more than 22 dialects; some countries share the same dialect, while many dialects may exist alongside MSA within the same Arab country. Arabic speakers prefer to use their own local dialect. Recently, more attention has been given to the Arabic dialects and the written varieties of Arabic found on social networking sites such as chats, micro-blogs, blogs, and forums which are the target of research on sentiment analysis and opinion extraction.