in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.
Social scientists should ideally play an important role for data science as many problems that data science works with – friending, connections, linking, sharing, talking – are ‘social science-y problems’ (Schutt and O’Neil, 2013, p. 9). As put by new media theorist Lev Manovich (2012, p. 461):
The emergence of social media in the middle of the 2000s created opportunities to study social and cultural processes and dynamics in new ways. For the first time, we can follow imaginations, opinions, ideas, and feelings of hundreds of millions of people. We can see the images and the videos they create and comment on, monitor the conversations they are engaged in, read their blog posts and tweets, navigate their maps, listen to their track lists, and follow their trajectories in physical space. And we don’t need to ask their permission to do this, since they themselves encourage us to do so by making all of this data public.
But even if we sometimes may have actual, real-life, well-motivated questions to pose to the data, data science notoriously runs the risk of becoming too data-driven. Indeed, data science is sometimes referred to as ‘data-driven science’ as its main aim actually is to extract knowledge from data. It is mostly not about testing hypotheses or theories in the traditional scholarly way. Instead, the work that is done with the data is driven by the data itself – in terms of the possibilities for gathering it, and the available tools for probing it.
A related concept is data mining. As the word ‘mining’ hints, this approach is about working to discover interesting patterns in large amounts of data, for example from the internet and social media. This approach marks a break with the established view of the research process – at least within the more objectivist types of science – where a problem or research question is formulated beforehand. This problem, formulated following a particular need for a certain type of knowledge about a specific issue, then guides the researcher in sampling data, devising the research methods, and choosing the theoretical perspectives – or even in formulating strict hypotheses to verify or falsify. Such a process is by no means axiomatic when it comes to data science, which makes no secret about often being highly explorative, and going fishing with a very wide net. In many cases a so-called data piñata approach is employed. As defined by the online resource Urban Dictionary:
data piñata: Big Data method that consists of whacking data with a stick and hopefully some insights will come out. [Example:] The Big Data Scientist made a Twitter data piñata and found that Saturdays are the weekdays with the most tweets linking to kitty pictures.
(Urban Dictionary, 2018)
Such strategies may be seen by some as unscientific, as they do not rely on actual questions about real problems, but on patterns that one stumbles across more or less randomly. Indeed, in the type of research that deals with solicited data, intently collected for certain research purposes, a data piñata approach would be odd. Why should we collect some random data, just to beat it with a stick to see what pops out? And, what type of data should that be? What methods or informants should be engaged, and how? In the case of register-based or database research, a piñata strategy might be closer at hand. And this is most definitely true in the case of the types of data that are enabled by people’s use of the internet and social media.
Census and survey researcher Kingsley Purdam and his data scientist colleague Mark Elliot aptly point out that today, to a lesser and lesser degree, data is ‘something we have’, rather: ‘the reality and scale of the data transformation is that data is now something we are becoming immersed and embedded in’ (Purdam and Elliot, 2015, p. 26). Their notion of a data environment underlines that people today are at the same time generators of, but also generated by, this new environment. ‘Instead of people being researched’, Purdam and Elliot (2015, p. 26) write, ‘they are the research’. Their point is that new data types have emerged – and are constantly emerging – that demand new flexible approaches. Doing digital social research, therefore, often entails discovering and experimenting with challenges and possibilities of ever-new types and combinations of information. Among these are not only social media data, but also data traces that are left, often unknowingly, through digital encounters. Manovich gives an explanation that is so to the point that it is worth citing at length:
In the twentieth century, the study of the social and the cultural relied on two types of data: ‘surface data’ about lots of people and ‘deep data’ about the few individuals or small groups. The first approach was used in all disciplines that adapted quantitative methods. The relevant fields include quantitative schools of sociology, economics, political science, communication studies, and marketing research. The second approach was used in humanities fields such as literary studies, art history, film studies, and history. It was also used in qualitative schools in psychology, sociology, anthropology, and ethnography. […] In between these two methodologies of surface data and deep data were statistics and the concept of sampling. By carefully choosing her sample, a researcher could expand certain types of data about the few into the knowledge about the many. […] The rise of social media, along with new computational tools that can process massive amounts of data, makes possible a fundamentally new approach to the study of human beings and society. We no longer have to choose between data size and data depth.
(Manovich, 2012, pp. 461–3)
Going back to 1978 and Glaser’s book on Theoretical Sensitivity, we can find some useful pointers on how to see the research process – beyond ‘quantitative’ and ‘qualitative’. The first step, for Glaser (1978, p. 3), is ‘to enter the research setting with as few predetermined ideas as possible’, to ‘remain open to what is actually happening’. The goal is then to alternate between having an open mind – working inductively, allowing an understanding of the research object to emerge gradually – and testing the emerging ideas as one goes along – working deductively trying to verify or falsify the developing interpretations. So, we can, quite mindlessly, beat on the piñata for a little while to see what jumps out. Then try to make sense of the things that emerged, and then beat some more to see what the new stuff that is popping out adds or removes from our present analysis.
Using Glaser’s approach, then, means being truly data-driven. He argues that the overarching question that must continually be posed in any research is: ‘What is this data a study of?’ (Glaser, 1978, p. 57). Most of the time, research projects start off with a clear idea of what to study. It would not make sense to be completely oblivious as to the aims of one’s work. But still, Glaser argues, constantly repeating and renewing the question of what the data is actually about, allows for any other ideas or findings to either take place alongside the initially intended ones, or even replace them completely. The point of the question is that it ‘continually reminds the researcher that his original intents on what he thought he was going to study just might not be; and in our experience it usually is not’ (Glaser, 1978, p. 57). The other important question for empirical research is: ‘What is actually happening in the data?’ The flexibility and inductiveness of the approach wants to get at ‘what is actually going on’ in the area that is studied (Glaser and Strauss, 1967, p. 239).
My point here is that being data-driven, as is often the case when working with big data, is not (only) a new ill, caused by the datafication of society and the fascination with huge datasets. Used in the right way, a data-driven approach – a data piñata – can be truly useful in getting to know more about what goes on, what social and cultural processes may be at work, in contexts and behaviours that are still largely unknown to us. From that perspective, not really knowing what we are looking for, and why, can be a means to tread new ground, veering off the well-trodden paths, to get lost to find our way. If we don’t even know what is going on, maybe beating that piñata with a stick isn’t such a bad idea? The new data science opportunities and tools, in combination with social theory has a huge potential to help decode the deeper meanings of society and sociality today.
Breaking things to move forward
Finding