4.1 QUANTITATIVE ANALYSIS
We begin by surveying the common quantitative methods for analyzing social data. We will summarize methods for identifying and filtering for relevant data, then analyzing the data, for example by extracting trends, and then validating the extracted information. This pipeline of quantitative methods is illustrated in Figure 4.1.
We will use one of the most common social monitoring uses, influenza surveillance, as our running example of social monitoring (with other tasks mentioned as needed) in order to illustrate the quantitative methodologies, but these methods are applicable to other public health problems as well.
The goal of influenza surveillance (described later in Section 5.1.1) is to measure the prevalence of influenza (flu) infection in a population. Official monitoring by government health agencies is delayed by at least one to two weeks, so social media has been used as a real-time supplementary source of monitoring. If you are familiar with social monitoring of influenza, you may find it strange that we chose to use it as our running example: the most popular system, Google Flu Trends, has been widely criticized for being unreliable. However, keep in mind that Google Flu Trends was one of the earliest systems to do this, using methods that are limited by today’s standards. While the system resulted in substantial errors, they are errors that could have been avoided using more sophisticated techniques, including those implemented by Google Flu Trends itself in later iterations [Santillana, 2017]. The takeaway is not that social monitoring for flu doesn’t work, but that it must be done thoughtfully and validated extensively. We will point out potential pitfalls as we go along, discussing validation in Sections 4.1.4 and 4.2.1, with general limitations discussed extensively later in Chapter 6.
Figure 4.1: A standard pipeline of quantitative methods for inferring trends from social data. The various steps are described in the indicated sections.
4.1.1 CONTENT ANALYSIS AND FILTERING
The first step in any data driven project is to ensure you have the data! When it comes to social monitoring, and the data comes in the form of tweets or messages on a variety of topics, it may be challenging to know if the available data support your research aims. Before investing time into planning a project, or collecting and processing data, you should determine if the data supports your goals. We typically advise researchers to identify 10 messages (by hand or through keyword search) that exemplify the data needed for the project. For example, Twitter provides a web search interface that makes these types of explorations easy.1 This process can also help you decide the best method for filtering the data. If you can’t find enough data at this stage, it’s unlikely you’ll be able to automatically mine the needed data.
When you know what you are looking for, you are ready to filter the data down to the subset of data relevant to the public health task at hand. For example, if the task is to conduct disease surveillance, then one must identify content that discusses the target disease (e.g., influenza). Approaches to filtering include searching for messages that match certain phrases, or using more sophisticated machine learning methods to automatically identify relevant content. We now describe these approaches in a bit more detail.
Keyphrase Filtering or Rule-based Approaches
Arguably the simplest method for collecting relevant content is to filter for data (e.g., social media messages or search queries) containing certain keywords or phrases relevant to the task. For example, researchers have experimented with Twitter-based influenza surveillance by filtering for tweets contain words like “flu” or “fever” [Chew and Eysenbach, 2010, Culotta, 2010, 2013, Lampos and Cristianini, 2010]. For Twitter data, tweets matching certain terms can straightforwardly be collected using Twitter’s Search API, described in Section 3.5. We note that there exist clinically validated sets of keywords for measuring certain psychological properties, such as emotions [Pennebaker et al., 2001].
Keyword and phrase-based filtering is thought to be especially effective for search queries, which are typically very short and direct, compared to longer text, like social media messages [Carmel et al., 2014]. Search-driven systems like Google Flu Trends [Ginsberg et al., 2009] rely on the volume of various search phrases. Most research that uses search query volumes is in fact restricted to phrase-based filtering, as data available through services such as Google Trends (described in Section 3.5) come as aggregate statistics about certain search terms, rather than the raw text that is searched, which is private data.
A special type of keyword is a hashtag. Hashtags are user-created labels (denoted with the # symbol) used to organize messages by topic, used primarily in status updates (e.g., on Twitter) or photo captions (e.g., on Instagram). Because hashtags are widely used by different users, they can serve as useful filters for health monitoring. For example, if one was interested in understanding physical activities in a population, one might search for hashtags such as #workout or #running. However, additional filtering may be needed to distinguish between messages by ordinary users and by advertisers or media outlets, e.g., “I had a great #workout today!” vs. “Top 10 #Workout Tips.” Rafail [2017] cautions that hashtag-based samples of tweets can be biased in unexpected ways.
Beyond searching for keywords or hashtags, other rules can be applied to filter for data. For example, one might choose to exclude tweets that contain URLs, which are less likely to be relevant for flu surveillance [Lamb et al., 2013]. By using machine learning, described in the next subsection, systems can learn which characteristics to favor or disfavor, rather than defining hard rules by hand.
Machine Learning Classification
Keyword-based filtering is limited because it does not distinguish between different contexts in which words or phrases appear. For example, not all tweets that mention “flu” indicate that the user is sick with the flu; a tweet might also discuss influenza in other contexts (for example, reporting on news of laboratory experiments on influenza) that are not relevant to surveillance.
A more sophisticated approach is to use machine learning to categorize data for relevance based on a larger set of characteristics than words alone. An algorithm that automatically assigns a label to a data instance (e.g., a social media message) is called a classifier. A classifier takes a message as input and outputs a discrete label, such as whether or not a message is relevant. For example, Aramaki et al. [2011] and Achrekar et al. [2012] constructed classifiers to identify tweets that are relevant to flu surveillance. Others have built classifiers to identify tweets that are relevant to health in general [Paul and Dredze, 2011, Prieto et al., 2014, Yin et al., 2015]. Lamb et al. [2013] combined multiple classifiers for a pipeline of filtering steps: first, a classifier identifies if a message is relevant to health, and if so, a second classifier identifies if a message is relevant to flu.
Classifiers learn to distinguish positive and negative instances by analyzing a set of labeled examples, and patterns learned from these “training” examples can then be used to make inferences about new instances in the future. Because training data is provided as examples, this approach is called supervised machine learning.
Common classification models include support vector machines (SVMs) and logistic regression, sometimes called a maximum entropy (MaxEnt) classifier in machine learning [Berger et al., 1996]. Logistic regression is commonly used for public health, traditionally as a tool for data analysis (see discussion of regression analysis in Section 4.1.3) rather than as a classifier, which predicts labels for new data. Recent advances in neural networks—loosely, models that stack and combine classifiers into more complex models—have made this type of model attractive for classification [Goldberg, 2017]. While more computationally intensive, neural networks can give state-of-the-art performance