clinical visit, parents come to believe in the dangers of vaccines outside of doctors’ offices, and the indicators that may suggest suicide are likely not being recorded by a health professional. Where can we find big data to answer these and many other public health questions? What digital records can be analyzed to support research on these topics?
Perhaps surprisingly, we already have a large source of patient information outside of the doctor’s office: user-generated content from the Web. This type of data includes, but is not limited to, blogs and microblogs, forum discussions, online reviews of products and services, and queries issued to search engines. But how does social media tell us anything about health? How can any of these online activities be used to answer important public health questions?
That is the topic of this book: how can large quantities of (often freely and publicly accessible) social media data inform public health? Public health—the area of medicine focused on the health of a population as a whole—depends on people’s behaviors: what people do in their everyday lives. Public health topics are often more about what happens outside than inside of a doctor’s office. Social media chronicles the lives of a population, recording their beliefs, attitudes, and behaviors on a wide variety of topics. Since health is an important part of people’s lives, social media reflects these health topics. By analyzing social media we can gain new insights into public health.
Who is this Book for?
Analyzing social media for public health requires two broad areas of expertise: computer science and public health. We hope that academics, researchers, and practitioners from both areas will find value in this book. Maybe you’re a data scientist who knows machine learning or natural language processing and wants to learn how to apply it to public health, or a health informaticist who wants to learn more about harnessing social media as an alternative data source, or a public health researcher who wants to learn about how new technologies offer new research possibilities. If so, you’re the intended audience for this book.
For computer scientists, we expect that Chapter 2 will provide a summary of the core principles of public health, and Chapter 5 will survey the areas of public health most suited for work in social monitoring. For public health experts, we hope that Chapters 3 and 4 will summarize the major types of social media data and relevant analytics. All readers should benefit from Chapter 6, which describes limitations and concerns of this type of research. Of course, we encourage you to read the entire book and share in our amazement over what has been achieved so far, and what new research may yield.
We expect that you’re coming into this field with one set of training and expertise, either on the computational side or public health side, and want to start learning more about the other area. This book is aimed at people in this stage, who want to know a little bit about the other side and how it can intersect with their own background. What this book will not do is make you an expert in a new area—this field is too broad and diverse to cover everything comprehensively in one book. For instance, this book won’t teach you enough to go off and build a machine learning system if you don’t already have that expertise—but it will introduce you to the common types of tools that are available and how they are used in social monitoring, which in turn will inform you about solutions available for your problems. And while this book can’t possibly do justice to decades of public health research in so many areas, it will at least make you aware of the major areas of public health, why they are important, and how social media can help. The goal is to equip you with enough knowledge to start thinking and having conversations about how you can benefit from, or contribute to, this rapidly growing field.
Why a Book? Why Now?
This new field of social monitoring for public health is quite new, with the earliest foundational papers barely ten years old. In fact, many of the data sources we discuss in this book haven’t even been around for that long. So why write a book now? While research in this area is fast paced, with new avenues of research yet unexplored, clear patterns have emerged to form a recognizable research landscape. We have some idea of what works, and what doesn’t work. What characteristics of public health questions are best suited for social media analysis, and which computational tools are most suited for answering these questions. Our goal is to provide a firm footing on which new researchers, as well as experienced experts, can base new research projects that build on what we’ve learned so far. We cannot possibly foresee all of the exciting new advances in this field, but we hope this book provides a basis on which these advances can start.
Another goal of this book is to promote rigor when working with social data. Methods for careful study design and validation that are common in traditional public health research have sometimes been ignored in research using social media, especially in earlier work, in part due to disciplinary differences in methodologies and a lack of community norms and expectations about how this kind of research should be done. The entire field came under scrutiny after it was noticed in a widely publicized study that Google Flu Trends, a popular digital flu monitoring system that we discuss throughout this book, had started performing inaccurately and severely misfired in a recent year [Lazer et al., 2014b]. Researchers have made a lot of progress in addressing the limitations of social data, but there are unresolved concerns about reliability, validation, and ethics with this kind of research. We raise these issues in this book, particularly in Chapter 6, and we hope our discussion of these issues will encourage more thoughtful work in this area.
The Scope of this Book
This book focuses on public health surveillance applications: tasks in which we can learn about public health topics by passively analyzing existing social media data. We term this social monitoring, a term that is inclusive of a wide range of online data sources, from new social media platforms, to more traditional web forums, and to search engine queries.
There is a growing and promising area of research that examines how social media and electronic interventions can change health behaviors and improve health outcomes. However, while related in spirit, the tools, topics, and approaches of interventions have significant differences with public health surveillance and social monitoring. This book focuses on the latter to ensure a more comprehensive presentation.
1
http://www.cdc.gov/mmwr/publications/index.html
2
http://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
3
http://www.cdc.gov/nchs/fastats/physician-visits.htm
4
http://www.cdc.gov/nchs/fastats/electronic-medical-records.htm
CHAPTER 2
Public Health: A Primer
Sandra is a college student. One morning during the fall semester, Sandra wakes up with a fever, cough, and headache. She feels sick enough that she decides to go to her campus’s student health services. At the clinic, a doctor diagnoses Sandra with influenza—the seasonal flu. For a young healthy person with no complications, the treatment is easy enough: drink plenty of fluids, stay in bed, and take ibuprofen or acetaminophen to help with the fever. After a few days, she will hopefully feel better and return to class.
Was it inevitable that Sandra contract the flu or could she have done something to prevent it? For many people, flu is a preventable disease. The seasonal flu shot offers remarkable protection against contracting influenza. The exact rates of immunity vary year to year, but on the whole, it is the single most effective step one can take to prevent a disease that infects tens of millions of Americans, hospitalizes hundreds of thousands, and kills thousands each year.1 Many universities organize