Simon Lindgren

Data Theory


Скачать книгу

is a consequence of what can be called the datafication of social life. This is what happens when ‘we have massive amounts of data about many aspects of our lives, and, simultaneously, an abundance of inexpensive computing power’ (Schutt and O’Neil, 2013, p. 4). Also beyond the internet and social media, there has been an increased influence of data into most industries and sectors. There has been huge interest, and many efforts made, to try to extract new forms of insight and generate new kinds of value in a variety of settings. As explained on Wikipedia (2018), lately ‘the term “big data” tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set’. As underlined by internet researchers Kate Crawford and danah boyd, ‘big data’ is in fact a poorly chosen term. This is because its alleged power is not mainly about its size, but about its capacity to compare, connect, aggregate, and cross-reference many different types of datasets (that also often happen to be big). They define big data as:

      (boyd and Crawford, 2012, p. 664)

      From a critically sociological perspective, Lupton (2014, p. 101) argues that the hype that surrounds the new technological possibilities afforded by big data analytics contribute to the belief that such data are ‘raw materials’ for information – that they contain the untarnished truth about society and sociality. In reality, each step of the process in the generation of big data relies on a number of human decisions relating to selection, judgement, interpretation, and action. Therefore, the data that we will have at hand are always configured via beliefs, values, and choices that ‘“cook” the data from the very beginning so that they are never in a “raw” state’. So, there is no such thing as raw data, even though the orderliness of neatly harvested and stored big datasets can create an illusion to the contrary.

      Sociologist David Beer (2016, p. 149) argues that we now live in ‘a culture that is shaped and populated with numbers’, where trust and interest in anything that cannot be quantified diminishes. Furthermore, in the age of big data, there is an obsession with causation. As boyd and Crawford (2012, p. 665) argue, the mirage and mythology of big data demand that a number of critical questions are raised with regard to ‘what all this data means, who gets access to what data, how data analysis is employed, and to what ends’. There is a risk that the lure of big data will sideline other forms of analysis, and that other alternative methods with which to analyse the beliefs, choices, expressions, and strategies of people are pushed aside by the sheer volume of numbers. ‘Bigger data are not always better data’, they write, and the analysis of them will not necessarily lead to insights about society that are more true than what can be achieved through other data and methods.

      We are no doubt in the midst of an ongoing data explosion, and along with it the development of ‘data science’. Data science is an interdisciplinarily oriented specialisation at the intersection of statistics and computer science, focusing on machine learning and other forms of algorithmic processing of large datasets to ‘liberate and create meaning from raw data’ rather than on hypothesis testing (Efron and Hastie, 2016, p. 451). Data science is a successor to the form of ‘data analysis’ proposed by the statistician John W. Tukey, whose analytical framework focused on ‘looking at data to see what it seems to say’, making partial descriptions and trying ‘to look beneath them for new insights’. In his exploratory vein, Tukey (1977, p. v) also emphasised that this type of analysis was concerned ‘with appearance, not with confirmation’. This focus on mathematical structure and algorithmic thinking, rather than on inferential statistical justification, is a precursor to the flourishing of data science in the wake of datafication.

      All the things that people do online in the context of social media generate vast volumes of sociologically interesting data. Such data have been approached in highly data-driven ways within the field of data science, where the aim is often to get a general picture of some particular social pattern or process. Being data-driven is not a bad thing, but there must always be a balance between data and theory – between information and its interpretation. This is where sociology and social theory come into the picture, as they offer a wide range of conceptual frameworks, theories, that can aid in the analysis and understanding of the large amounts and many forms of social data that are proliferated in today’s world.

      It is my argument that the social research that relies heavily on the computational amassing and processing of data must also have a theoretical sensitivity to it. While purely computational methods are extremely helpful when wrangling the units of information, the meanings behind the messy social data which are generated in this age of datafication can be better untangled if we also make use of the rich interpretive toolkit provided by sociological theories and theorising. The data do not speak for themselves, even though some big data evangelists have claimed that to be the case (Anderson, 2008).

      Big data and data science are partly technological phenomena, which are about using computing power and algorithms to collect and analyse comparatively large datasets of, often, unstructured information. But they are also most prominently cultural and political phenomena that come along with the idea that huge unstructured datasets, often based on social media interactions and other digital traces left by people, when paired with methods like machine learning and natural language processing, can offer a higher form of truth which can be computationally distilled rather than interpretively achieved.

      Pure data science tends to focus very strongly simply on what is researchable. It goes for the issues for which there are data, no matter if those issues have any real-life urgency or not. The last decade