form from the outset. The digital universe – the data we create and copy annually – is estimated to be doubling in size every two years and projected to reach 44 trillion gigabytes by 2020 (where a trillion is a million million, or 1012) (IDC, 2014). For social scientists, the predictions that more data will be generated in the next five years than in the entire history of human endeavour is both an opportunity and a challenge.
Today, vast amounts of data are generated as people go about their daily activities, both data that is deliberately produced and that which is generated by embedded systems. For example, use of public services is captured in administrative records; in the private sector, patterns of consumption of goods and services are captured in credit and debit card records; patterns of personal communications are captured in telephone records; patterns of movement are logged by sensors, such as traffic cameras, satellites and mobile phones; the movement of goods is increasingly tracked by devices such as radio-frequency identification (RFID) tags; and the advent of the ‘Social Web’ has led to an explosion of citizen-generated content in blogs and on social networking sites.
Currently, these data sources are barely exploited for social research purposes. The potential benefits to researchers are enormous, offering opportunities to mount multidisciplinary investigations into major social and scientific issues on a hitherto unrealizable scale by marshalling artificially produced and naturally occurring ‘big data’ of multiple kinds from multiple sources. However, exploiting these digital data sources to their full research potential requires new mechanisms for ensuring secure and confidential access to sensitive data, and new analysis tools for mining, integrating, structuring and visualizing data from multiple sources.
1.1.2 e-Infrastructure
Since the beginning of the new millennium, a world-wide effort has been underway to create the research infrastructure and to develop the research methods that will be needed if the ‘data deluge’ is to be harnessed effectively for research. A new generation of distributed digital technologies is leading to the development of interoperable, scalable computational tools and services that increasingly make it possible for researchers to locate, access, share, aggregate, manipulate and visualize digital data seamlessly across the Internet on a scale that was unthinkable only a decade or so ago.
e-Infrastructure comprises the information and communication technologies (ICTs) – the networked computing hardware and software – and the digital data that are deployed to support research. A very broad definition has been adopted by Research Councils UK (2014), which spells out more fully the components that are brought together:
e-Infrastructure refers to a combination and interworking of digitally-based technology (hardware and software), resources (data, services, digital libraries), communications (protocols, access rights and networks), and the people and organisational structures needed to support modern, internationally leading collaborative research be it in the arts and humanities or the sciences.
This definition highlights the complexity of e-Infrastructure and, correspondingly, the enormity of the socio-technical efforts required to efficiently integrate distributed computers, data, people and organizations in order to deliver tools and services that scientists can readily adopt to their advantage in pursuing their research. (In the US, the term cyberinfrastructure is more commonly used than e-Infrastructure.)
e-Research is the generic term that has been coined for the innovations in research methods that are emerging to take advantage of this new and vastly more powerful e-Infrastructure. Similarly, e-Social Science is the research facilitated by the e-Infrastructure. The ‘e’ in all these terms is short for ‘electronic’, although it is sometimes rendered as ‘enhanced’.
The scope of the book is the application of e-Research methods across the social sciences, including both quantitative and qualitative data collection and analysis. The aim is to introduce the reader to the application of innovative digital research methods throughout the research lifecycle, from resource discovery, through the collection, manipulation and analysis of data, to the presentation and publication of results.
1.2 Background
1.2.1 e-Science
Over the period 2001 to 2006, the UK Government invested £213m in an e-Science programme (Hey and Trefethen, 2004). The overall aim of the programme was to invent and apply computer-enabled methods to ‘facilitate distributed global collaborations over the Internet, and the sharing of very large data collections, terascale computing resources and high performance visualizations’.1 The funding was divided between a ‘core programme’, focused on developing the generic technologies needed to integrate different resources seamlessly across computer networks, and individual Research Council programmes specific to the disciplines they support. The Economic and Social Research Council (ESRC) allocation was £13.6m over the five years, with the major part of this investment devoted to setting up the National Centre for e-Social Science (NCeSS). The Centre had a distributed structure, with a coordinating Hub responsible for designing and managing the programme and eleven large three-year projects devoted to developing innovative tools and services and applying them in substantive fields of inquiry.
The ambition of the overall e-Science programme was to promote the adoption of innovations in digital infrastructure to facilitate bigger and faster science, with collaborators worldwide addressing major research questions in new ways. The initial technical focus was grid computing, driven by a set of ‘middleware’ standards. These are the shared protocols required for the development of sophisticated software to enable large numbers of distributed and heterogeneous computer systems to be linked and inter-operate, thereby providing researchers with seamless, on-demand access to scalable processing power to handle very large-scale datasets, regardless of the location of the researchers or the data. This model of e-Infrastructure was particularly appropriate to particle physics and such challenges as weather prediction and earthquake modelling. Advances in these areas are dependent on collecting and marshalling data on a vast scale and having huge computing resources to analyse it, accessible by large networks of research teams distributed across the world.
However, the grid computing blueprint for e-Infrastructure proved slow to mature, sometimes difficult to deploy in practice and it did not always offer the most appropriate solutions to scientists’ requirements. Meanwhile, other technologies emerged and alternative solutions to the demand for scalable computing and data storage, such as cloud computing, became available. Alongside this was the flowering of the lightweight systems that are loosely collected together under the title of Web 2.0 (O’Reilly, 2005). While these are technically less powerful than grid-based systems, their relative simplicity – both in terms of implementation effort and ease of use – made them attractive to researchers who did not need sophisticated tools and services, and who were deterred from using grid services by their complexity and the perceived barriers to access. Moreover, many of these Web 2.0 tools and services are freely available on the Internet, and users can find help in adopting them in numerous online forums and support groups. They have been widely taken up because of their ability to deliver easy-to-use services via simple protocols and familiar Web-based user interfaces, and they provide flexible solutions to at least some researchers’ needs for advanced computing tools and services. Accordingly, across the sciences the notion of grid computing being at the core of e-science gradually gave way to a wider understanding of e-Infrastructure, embracing a broad range of computing software and services that support the everyday work of scientists.
1.2.2 e-Social Science
From the start of the e-Science programme, the ambitions of grid computing were less matched to those disciplines subsequently encouraged to join the e-Science bandwagon, including the social sciences, where a mixture of numerous quantitative and qualitative methods is used to pursue relatively small-scale issues. These disciplines have very few generic problems requiring complex middleware to coordinate huge distributed computing and data resources. What requirements they do have were already – before the e-Science programme was initiated – well-served by established commercial and open-source packages to,