Группа авторов

Innovations in Digital Research Methods


Скачать книгу

that data is now something we are becoming immersed and embedded in. We are generators of, but are also generated in, the data environment. Our behaviour is increasingly documented and collated. Instead of people being researched, they are the research. Hence, we use the term the age of data to capture the historical phase that large parts of society have now entered, and we use the term data environment (see Elliot et al. 2008 and 2010 for discussion of the term) to capture the reality of the new relationship between people and what is known about them. This can include a focus not only on explaining why something might have happened, but also on what is currently happening and is going to happen.

      If they are going to be used effectively for research, the new data types and large-scale datasets require new approaches to analysis and new skills for social scientists. After all, social science should be capable of producing testable hypotheses using robust research designs and data quality assurance measures even where new types of data are being used. Such data also has its limitations and is not always accessible for social science research use. Moreover, big data does not mean we all have access to the data or that we know everything. There is still a need for purpose-specific data and for approaches based on testing theories.

      In this chapter we consider some examples of the new types of social data, including their formats, content, meanings, and the changing relationship between people’s digital and non-digital identities. We use real world examples to explore how social science might utilize new types of data to understand social phenomena in new ways and from new perspectives. As well as the data itself, we consider access modalities and processes. It is clear that what is happening in the data environment will change not just how we do social science research but who does it, where it is done and, indeed, what research means. However, as a recent consultation (Elliot et al., 2013: 4) on the use of digital data by social scientists highlighted, some concerns have been raised:

      There is more data for social research but can people use it, under what conditions and do they know how to? (Social scientist, stakeholder interview, 2012)

      There is a growth of under-theorised empiricism in social science…uncritical use of data with limitations in coverage or definitions and the steering of research to things that happened to be measured. (Social scientist, survey, respondent, 2012)

      2.1.2 What is Data?

      Data is information or knowledge about an individual, object or event. Data can comprise numerical values, quantities of text, sounds or images, memories or perceptions. Often the concept of data suggests information that has a structure and which has been through some kind of processing.

      Many examples of new types of data have very different and sometimes unstructured formats, for example, tweets or documents released under a Freedom of Information (FOI) request. In order to develop our understanding of the changing data environment, we outline below a typology of different data types. This typology is based on the idea of data as knowledge but also in terms of each data item carrying with it implicit or explicit metadata, that is, data about the data item, such as its origin, ownership, terms of use and coverage. There are a variety of ways to consider the nature of data but here we combine the key issues into a single framework. We draw on work by Elliot et al. (2010) on behalf of the Office for National Statistics (ONS) in the UK, which examined the nature of public data, comparing information that is formally in the public domain, such as public administrative records (e.g., the Electoral Register, share holdings and professional occupation lists) and data that is informally in the public domain, such as that posted on the Internet (e.g., via Facebook and blogs). For a related discussion of what they term datafication, which refers to the process of recording and quantifying behaviour and events for analysis, see Mayer-Schönberger and Cukier (2013: 73).

      We develop our approach here to focus on what can be termed the ‘metadata of origin’, rather than the actual type of data or whether the data is qualitative or quantitative. The issue of origin is interdependent with issues of data ownership, quality, access and use. A key aspect of this is the law and codes of practice around the recognition of what is ‘personal’ data. Under the UK Statistics and Registration Service Act (2007) (SRSA) personal information is defined as information which relates to and identifies a particular person (including a body corporate)’. Information identifies a particular person if the identity of that person – ‘(a) is specified in the information, (b) can be deduced from the information, or (c) can be deduced from the information taken together with any other published information’.5 The disclosure of personal information by public bodies, such as the ONS, is a criminal offence. For further information see the UK Anonymization Network6 and also a recent report by the Information Commissioner (ICO, 2012).

      In terms of the metadata of origin approach, we propose an eight-point typology based on the type of generation process involved. Given the complexity and changing nature of the data environment, it can be argued that mapping the data generation process is the only stable way of understanding the variety of data and for developing good practice around the use of different data types.

      2.1.3 Data Origin Typology

      1 Orthodox intentional data: Data collected and used with the respondent’s explicit agreement. All so-called orthodox social science data (e.g. survey, focus group or interview data and also data collected via observation) would come into this category. New orthodox methods continue to be developed.

      2 Participative intentional data: In this category data are collected through some interactive process. This includes some new data forms such as crowdsourced data (e.g. the Everyday Sexism project; see http://everydaysexism.com) and is a potential growth area.

      3 Consequential data: Information that is collected as a necessary transaction that is secondary to some (other) interaction (e.g. administrative records, electronic health records, commercial transaction data and data from online game playing all come into this category).

      4 Self-published data: Data deliberately self-recorded and published that can potentially be used for social science research either with or without explicit permission, given the information has been made public (e.g. long-form blogs, CVs and profiles).

      5 Social media data: Data generated through some public, social process that can potentially be used for social science research either with or without permission (e.g. micro-blogging platforms such as Twitter and Facebook, and, perhaps, online game data).

      6 Data traces: Data that is ‘left’ (possibly unknowingly) through digital encounters, such as online search histories and purchasing, which can be used for social science research either by default use agreements or with explicit permission.

      7 Found data: Data that is available in the public domain, such as observations of public spaces, which can include covert research methods.

      8 Synthetic data: Where data has been simulated, imputed or synthesized. This can be derived from, or combined with, other data types.

      We utilize this typology further in our discussions below, including the possible overlaps between the data origin types, and how the different types may be used, but we first focus in more detail on the changing nature of the data environment and social science research.

      2.2 The Social Science Data Present

      2.2.1 The Data Landscape

      It has been clear since the 1980s that the half century either side of the millennium would be characterized by an information revolution (Purdam et al., 2004; Sweeney, 2001). One key aspect of this is the massive increase not just in the amount of data but also in the types of data sources available and in the range of organizations and individuals collecting, storing and using data. For example, it is estimated that in 2014 there are 1.3 billion active Facebook accounts, 0.6 billion active Twitter accounts and 58 million tweets per day (Datablog, 2014).

      The