href="#u3f41e0ec-d38a-5ef4-9081-35954a5f5d7b">Appendix A provides a partial survey of online sources of textual data, which is the raw material of your research project. Appendices B through G provide, as it were, a survey of the practical tools that are available for house construction, from hand tools to heavy-duty machinery. While setting the foundation, designing the house, and choosing a construction method, it is a good idea to be aware of the types of practical tools that are available and within budget so that your project can reach a successful conclusion. Appendices H and I, as well as the Glossary, provide handy summaries of web resources, statistical tools, and key terms.
Additional resources for instructors using An Introduction to Text Mining are also provided. Editable, chapter-specific Microsoft¯ PowerPoint¯ slides, as well as assignments and activities created by the authors, are available for download at: http:/study.sagepub.com/introtextmining.
Note to the Reader
An Introduction to Text Mining grew out of our earlier SAGE methods guidebook Text Mining, which is a shorter volume intended to serve as a practical guidebook for graduate students and professional researchers. The two books share both a core mission and structure. Their mission is to enable readers to make better informed decisions about research projects that use text mining and text analysis methodologies. And they both survey text mining tools developed in multiple disciplines within the social sciences, humanities, and computer science.
Where Text Mining was intended for advanced students and researchers, the current volume is a dedicated undergraduate or first-year graduate textbook intended for use in social science and data science courses. This book is thus longer than Text Mining, as it includes new material related to ethical and epistemological considerations in text-based research. There is a new chapter on how to write text-based social science research papers. And there are appendices that list and review data sources and software for preparing, cleaning, organizing, analyzing, and visualizing patterns in texts. Although these appendices were intended for students in undergraduate courses we suspect that they will prove valuable for experienced researchers as well.
GI and RM
About the Authors
Gabe Ignatowis an associate professor of sociology at the University of North Texas (UNT), where he has taught since 2007. His research interests are in the areas of sociological theory, text mining and analysis methods, new media, and information policy. Gabe’s current research involves working with computer scientists and statisticians to adapt text mining and topic modeling techniques for social science applications. Gabe has been working with mixed methods of text analysis since the 1990s and has published this work in the following journals: Social Forces, Sociological Forum, Poetics, the Journal for the Theory of Social Behaviour, and the Journal of Computer-Mediated Communication. He is the author of over 30 peer-reviewed articles and book chapters and serves on the editorial boards of the journals Sociological Forum, the Journal for the Theory of Social Behaviour, and Studies in Media and Communication. He has served as the UNT Department of Sociology’s graduate program codirector and undergraduate program director and has been selected as a faculty fellow at the Center for Cultural Sociology at Yale University. He is also a cofounder and the CEO of GradTrek, a graduate degree search engine company.
Rada Mihalceais a professor of computer science and engineering at the University of Michigan. Her research interests are in computational linguistics, with a focus on lexical semantics, multilingual natural language processing, and computational social sciences. She serves or has served on the editorial boards of the following journals: Computational Linguistics, Language Resources and Evaluation, Natural Language Engineering, Research on Language and Computation, IEEE Transactions on Affective Computing, and Transactions of the Association for Computational Linguistics. She was a general chair for the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL, 2015) and a program cochair for the Conference of the Association for Computational Linguistics (2011) and the Conference on Empirical Methods in Natural Language Processing (2009). She is the recipient of a National Science Foundation CAREER award (2008) and a Presidential Early Career Award for Scientists and Engineers (2009). In 2013, she was made an honorary citizen of her hometown of Cluj-Napoca, Romania.
1 Text Mining and Text Analysis
Learning Objectives
The goals of Chapter 1 are to help you to do the following:
1 Familiarize yourself with a variety of research projects accomplished using text mining tools.
2 Address different research questions using text mining tools.
3 Differentiate between text mining and text analysis methodologies.
4 Compare major theoretical and methodological approaches to both text mining and text analysis.
Introduction
Text mining is an exciting field that encompasses new research methods and software tools that are being used across academia as well as by companies and government agencies. Researchers today are using text mining tools in ambitious projects to attempt to predict everything from the direction of stock markets (Bollen, Mao, & Zeng, 2011) to the occurrence of political protests (Kallus, 2014). Text mining is also commonly used in marketing research and many other business applications as well as in government and defense work.
Over the past few years, text mining has started to catch on in the social sciences, in academic disciplines as diverse as anthropology (Acerbi, Lampos, Garnett, & Bentley, 2013; Marwick, 2013), communications (Lazard, Scheinfeld, Bernhardt, Wilcox, & Suran, 2015), economics (Levenberg, Pulman, Moilanen, Simpson, & Roberts, 2014), education (Evison, 2013), political science (Eshbaugh-Soha, 2010; Grimmer & Stewart, 2013), psychology (Colley & Neal, 2012; Schmitt, 2005), and sociology (Bail, 2012; Heritage & Raymond, 2005; Mische, 2014). Before social scientists began to adapt text mining tools to use in their research, they spent decades studying transcribed interviews, newspaper articles, speeches, and other forms of textual data, and they developed sophisticated text analysis methods that we review in the chapters in Part IV. So while text mining is a relatively new interdisciplinary field based in computer science, text analysis methods have a long history in the social sciences (see Roberts, 1997).
Text mining processes typically include information retrieval (methods for acquiring texts) and applications of advanced statistical methods and natural language processing (NLP) such as part-of-speech tagging and syntactic parsing. Text mining also often involves named entity recognition (NER), which is the use of statistical techniques to identify named text features such as people, organizations, and place names; disambiguation, which is the use of contextual clues to decide where words refer to one or another of their multiple meanings; and sentiment analysis, which involves discerning subjective material and extracting attitudinal information such as sentiment, opinion, mood, and emotion. These techniques are covered in Parts III and V of this book. Text mining also involves more basic techniques for acquiring and processing data. These techniques include tools for web scraping and web crawling, for making use of dictionaries and other lexical resources, and for processing texts and relating words to texts. These techniques are covered in Parts II and III.
Research