treat each message as a set of predictors, called features in machine learning, typically consisting of the words in a document, and sometimes longer phrases as well. Phrases of length n are called n-grams, while individual words are called unigrams. One can also use additional linguistic information as features. Natural language processing (NLP) is an area of computer science that involves processing human language, and a number of NLP tools exist to parse linguistic information from text. For example, Lamb et al. [2013] showed that classification performance can be improved by including linguistic features in addition to n-grams, like whether “flu” is used as a noun or adjective, or whether it is the subject or object of a verb.
We won’t get into the technical details of classification in this book, but many of the common toolkits for machine learning (a few of which are described at the end of this section) provide tutorials.
Unsupervised Clustering and Topic Modeling
An alternative to classification is clustering. Clustering has the same goal as classification—organizing messages into categories—but the categories are not known in advance; rather, messages are grouped together automatically based on similarities. This is a type of unsupervised machine learning.
A popular method of clustering for text documents is topic modeling. In particular, probabilistic topic models are statistical models that treat text documents as if they are composed of underlying “topics,” where each topic is defined as a probability distribution over words and each document is associated with a distribution over topics. Topics can be interpreted as clusters of related words. In other words, topic models cluster together words into topics, which then allows documents with similar topics to be clustered. Probabilistic topic models have been applied to social media data for various scientific applications [Ramage et al., 2009], including for health [Brody and Elhadad, 2010, Chen et al., 2015b, Ghosh and Guha, 2013, Paul and Dredze, 2011, 2014, Prier et al., 2011, Wang et al., 2014].
The most commonly used topic model is Latent Dirichlet Allocation (LDA) [Blei et al., 2003], a Bayesian topic model. For the domain of health, Paul and Dredze developed the Ailment Topic Aspect Model (ATAM) [2011, 2014], an extension of LDA that explicitly identifies health concepts. ATAM creates two different types of topics: non-health topics, similar to LDA, as well as special “ailment” word distributions with words that are found in dictionaries of disease names, symptom terms, and treatments. Examples of ATAM ailments are shown in Figure 4.2.
An advantage of topic models over simple phrase-based filtering is that they learn many words that are related to concepts. For example, words like “cough” and “fever” are associated with “flu.” When inferring the topic composition of a document, the entire context is taken into account, which can help disambiguate words with multiple meanings (e.g., “dance fever”). A disadvantage is that they are typically less accurate than supervised machine learning methods, but the tradeoff is that topic models can learn without requiring annotated data. Another consideration of topic models is that they discover broad and popular topics, but additional effort may be needed to discover finer-grained issues [Prier et al., 2011].
Another use of topic models, or unsupervised methods in general, is for exploratory analysis. Unsupervised methods can be used to uncover the prominent themes or patterns in a large dataset of interest to a researcher. Once an unsupervised model has revealed the properties of a dataset, then one might use more precise methods such as supervised classification for specific topics of interest.
The technical details of probabilistic topic models are beyond the scope of this book. For an introduction, we recommend reading Blei and Lafferty [2009].
Which Approach to Use?
We have mentioned a variety of approaches to identifying social media content, including keyword filtering, classification, and topic modeling. These approaches have different uses and tradeoffs, so the choice of technique depends on the data and the task.
Most research using a large, general platform like Twitter will require keyword filtering as a first step, since relevant content will be such a small portion of the overall data, whether that requires keywords related to a particular topic like flu or vaccination, or health in general—for example, Paul and Dredze [2014] used a few hundred health-related keywords to collect a broad range of health tweets, which is still only a small sample of Twitter. Keyword filtering can be reasonably reliable for obtaining relevant content, although it may miss data that is relevant but uses terminology not in the keyword list, or it may identify irrelevant data that uses terms in different ways (e.g., slang usage of “sick”). Classifiers can overcome the limitations of keyword filtering, but are time consuming to build, so they are generally considered as a next step if keywords are insufficient. Topic models, on the other hand, are most often used for exploratory purposes—understanding what the content looks like at a high level—rather than looking for specific content.
Figure 4.2: Examples of ailment clusters discovered from tweets, learned with the Ailment Topic Aspect Model (ATAM) [Paul and Dredze, 2011]. The word clouds show the most probable words in each ailment, corresponding to (clockwise from top left) allergies, dental health, pain, and infuenza-like illness.
These techniques are not mutually exclusive, and it is not unreasonable to combine all three. Let’s illustrate this with an example. Suppose you want to use social media to learn how people are responding to the recent outbreak of Zika, a virus that can cause birth defects and had been rare in recent years until a widespread outbreak in 2015 originating in Brazil. (In fact, several researchers have done just that [Dredze et al., 2016c, Ghenai et al., 2017, Juric et al., 2017, Miller et al., 2017, Muppalla et al., 2017, Stefanidis et al., 2017].)
You decide to study this on Twitter, which captures a large and broad population. The first step is to collect tweets about Zika. There aren’t a lot of ways to refer to Zika without using its name (or perhaps its Portuguese translation, Zica, or its viral abbreviation, ZIKV). You might therefore start with a keyword filter for tweets containing “zika,” “zica,” or “zikv,” which would account for a tiny fraction of Twitter, but probably nearly all tweets about Zika, at least explicitly.
If you don’t already know what people discuss about Zika on Twitter (since it was not widely discussed until recently, after the outbreak), you might use a topic model as a starting point to identify the major themes of discussion in your dataset. After running and analyzing a topic model, you might find that in the context of Zika, people use Twitter to talk about the latest research, vaccine development, political and funding issues, pregnancy and birth issues, and travel bans and advisories.
Suppose you are interested in using social monitoring to learn how people are changing their behavior in response to the virus, so you decide to focus on topics related to pregnancy and travel. To narrow down to tweets on these topics, you could construct a list of additional keywords for filtering, maybe using the word associations learned by the topic model, or using your own ideas about relevant words, perhaps gained by manually reading a sample of tweets. Finally, if you need to identify tweets that can’t be captured with a simple keyword list (for example, you want to identify when someone mentions that they are personally changing travel plans, as opposed to more general discussion of travel advisories), then you should label some of the filtered tweets for relevance to your task and train a classifier to identify more such tweets.
Tools and Resources
A number of free tools exist for the machine learning tasks described above, although most require some programming experience. For a guide aimed at a public health audience rather than computer scientists, see Yoon et al. [2013]. For computationally oriented researchers, we recommend the following machine learning tools.
• scikit-learn (http://scikit-learn.org
) is a Python library for a variety of general purpose machine learning tasks, including classification