Brownstein, 2014, Tausczik et al., 2012]. A limitation of Wikipedia logs as a data source is that they do not contain information about the locations of the readers, unlike most of these data sources (Section 3.4.2). Instead, researchers have used the language of articles as proxies for location [Generous et al., 2014], such as resolving French-language articles to France. However, this approach is coarse and unreliable, as many languages are widespread.
3.3.4 CROWDS AND MARKETS
Crowdsourcing is a method of obtaining feedback and assistance from large numbers of people using online services. For example, Amazon’s Mechanical Turk service is a general-purpose platform where users can post tasks to be completed, and other users are paid to complete the tasks [Buhrmester et al., 2011, Callison-Burch and Dredze, 2010, Goodman et al., 2013, Paolacci et al., 2010, Shapiro et al., 2013]. Crowdsourcing platforms allow for large-scale recruitment of workers to participate in projects.
Domain-specific crowdsourcing systems exist for health. For example, Flu Near You [Baltrusaitis et al., 2017, Crawley et al., 2014, Smolinski et al., 2015] is an application where users are periodically asked to share their health status—whether they are experiencing the flu—and this data can be used to estimate flu prevalence.
Crowd-based systems are a form of active monitoring, as discussed in Section 3.2.1. That is, learning about a population through crowdsourcing requires active involvement of the community, in contrast to the other platforms described above, in which publicly accessible information can be passively monitored.
Prediction markets are another way of harnessing crowds. Prediction markets are markets where future outcomes are traded—essentially, participants bet on what they think will happen—and prices can be used to measure the likelihood of different outcomes, according to the beliefs of the crowd. A few studies have shown prediction markets to be effective for forecasting diseases [Li et al., 2016, Polgreen et al., 2007, Tung et al., 2015].
3.3.5 COMPARISON OF PLATFORMS
The choice of data source in this diverse landscape is motivated by the type of application. General-purpose social media is a good source for identifying common, real-time trends. Topics such as influenza and vaccines are often discussed in the population at large, and so are well-represented in general-purpose social media. Furthermore, the nature of this type of platform provides real-time data, making it a good resource for studying current trends. Moreover, general-purpose platforms include discussion on a variety of topics outside of health, which allows one to study how people’s habits and behaviors across a variety of domains interact with their health.
There are many general-purpose social media platforms, each with their own characteristics, features, and user populations. See Osborne and Dredze [2014] for a comparison of some of these platforms.
In contrast, domain-specific social media is best suited for an in-depth study of a specific health condition, especially those that are not common in the general population. The communities surrounding specific diseases and health topics provide rich details into the thoughts and behaviors of people engaged with the particular topic. Furthermore, many of these forums go back years, allowing for analysis of trends over a long period of time.
Search activity provides both real-time and historical capabilities. For example, Google Trends1 provides historical data back to 2004, as well as daily updates of search activity (and in some cases, hourly). Additionally, search queries cover a wide range of subjects and so can provide information on low-prevalence health conditions. However, search activity often misses the “why” of health behaviors. While we can sometimes ascertain the reason behind a query based on the keywords in a search, often times it is impossible to know the user intention. In short, search traffic can answer “what,” but not always “why.” Additionally, because search activity in the form publicly available to researchers is aggregated across users, we cannot undertake the type of user analysis, or the linking of multiple queries to a single user, that may be needed for fully understanding the data.
We note that not only are different platforms used in different ways, but they are used for different topics of health discussion. De Choudhury et al. [2014b] compared the prevalence of mentions of health issues in tweets vs. search query logs, finding that more serious and stigmatizing conditions (e.g., sexually transmitted disease) are more prevalent in search logs than tweets, while certain benign conditions (e.g., jet lag) are more prevalent in Twitter. The authors thus suggest using caution when using Twitter to study high-stigma conditions, due to the apparent self-censorship being applied in public social media. However, a study of privacy settings in Facebook did not find large differences in content posted by public accounts vs. private accounts, which suggests that public social media data may not be as biased as previously believed [Fiesler et al., 2017].
Finally, the users of different platforms have different demographic characteristics; see Duggan et al. [2015] for a summary.
3.4 TYPES OF DATA
We will now discuss the various forms of data available from social media, such as text (e.g., from tweets or search queries), locations (e.g., precise coordinates or geographic entities), and social network information (e.g, friends and followers).
3.4.1 CONTENT
The bulk of web content is in the form of text. Text can often be analyzed by searching for messages containing particular words or phrases of interest. More sophisticated analyses of text require natural language processing, described in Section 4.1.1, which is a computational approach to automating linguistic analysis of language. Most social monitoring uses text, and this book will focus on text.
Other content may come in the form of images (such as through Instagram) and video (such as through YouTube), which are often also accompanied by text in the form of captions, descriptions, and user comments. Images and video can be automatically analyzed and categorized using computer vision, a computational approach to analyzing imagery. For example, Garimella et al. [2016] found that automatically extracted tags of Instagram images can be useful for some health applications, like detecting excessive drinking. However, these types of tools are limited, so most research using this type of media have relied on manual analysis by people.
3.4.2 METADATA
Metadata, such as the time and location of messages, are crucial for social media analysis, in order to understand variation in populations.
Time
Almost all data on the Web is timestamped, and this information is typically trivial to collect. Often individual messages will come with timestamps, typically at the granularity of seconds. For some types of data, individual messages are unavailable, and only aggregate information over an interval of time, such as a day or month, is available. This is the case with services like Google Trends, which do not share individual search queries, but will provide the number of queries issued within various time intervals.
Location
Obtaining the location of a message—that is, the location of the author who wrote it—is often more difficult to obtain than time information, yet is often critical for health applications [Burton et al., 2012b]. Sometimes location information is provided by the social media platform. For example, Twitter allows users to provide detailed location information in the form of latitude and longitude coordinates, which are sometimes available when users participate with a GPS-enabled device. Additionally, users can tag a location in their tweet, such as a city, neighborhood or specific point of interest. Unfortunately, this type of location data is rare; only a small percentage of tweets contain coordinates. For example, roughly 1–3%