case, however, data replacement would be about some basic data; surveys will be anyway useful for collecting data about specific topics and behaviors. For example, think of all the surveys with a focus on consumption. If big data assets are providing insights on consumption via passive observation, primary research via surveys will not have to collect this type of information, and it is finally possible to deliver on the vision of shorter surveys instead of simply providing complementary data to the desired information. Surveys can be short and focused on those variables that they are ideally suited for, resulting in better data quality.
In editing and imputation.
In estimation (e.g., as auxiliary information in calibration estimation, benchmarking, or calendarization).
In comparing survey estimates with estimates from a related administrative program as well as other forms of survey evaluation have been experienced.
When using a multisource approach in web surveys, several aspects should be considered.
First of all is the heterogeneous nature of the sources with respect to the following basic characteristics: the aggregation level, the unit, the variables, the coverage, the time, the population, and the data type:
The aggregation level, i.e., some data sources consist of only microdata, some other data sources consist of a mix of microdata and aggregated data, whereas in some other cases data sources consist of only aggregated data. In some case, aggregated data are available besides microdata. There is still overlap between the sources, from which there arises the need to reconcile the statistics at some aggregated level. Of particular interest is when the aggregated data are estimates themselves. Otherwise, the conciliation can be achieved by means of calibration, a standard approach in survey sampling.
As regards the units, it has to be considered that sometimes there are no overlapping units in the data sources or only some units are overlapping. Also, as regards the variables, no overlapping variables in the data sources could occur, or only variables in the data sources could overlap.
Under‐coverage versus there is no under‐coverage has to be considered. The data sources are cross‐sectional versus other data sources are longitudinal; thus, the researcher should take care of what type of data he is integrating. The set of population units from a population register could be known, or the population list is not known; this affects the possibility of generating a probability‐based sample. In some cases, a data source contains a complete enumeration of its target population, or a data source is selected by means of probability sampling from its target population, or a data source is selected by non‐probability sampling from its population. The database may be further split into two subcases depending on whether one of the data sources consists of sample data (and where the sampling aspects play an important role in the estimation process) or not. In the former case, specific methods should be used in the estimation process, for instance, taking the sampling weights into account and considering that sample data may include specific information that is not reported in register.
Another aspect is the configuration of the sources to be integrated. There are a few basic ways, most commonly encountered. However, in practice, a given situation may well involve several basic configurations at the same time.
The first and most basic configuration of the integration process of different sources is multiple cross‐sectional data that together provide a complete data set with full coverage of the target population. Provided they are in an ideal error‐free state, the different data sets, or data sources, are complementary to each other and can be simply “added” to each other in order to produce output statistics.
A second type of configuration is when there exists overlap between the different data sources. The overlap can concern the units, the measured variables, or both.
A third situation is when the combined data entail under‐coverage of the target population in addition, even when the data are in an ideal error‐free state.
A further configuration is when microdata and aggregated data are available. There is overlap between the sources, but there is the need to reconcile the statistics at some aggregated level. The conciliation can be achieved by means of calibration, which is a standard approach in survey sampling. Of particular interest is when the aggregated data are estimates themselves.
Finally, it is possible that multisource approach refers to longitudinal data. More questions arise; the most important issue is that of reconciling time series of different frequencies and qualities. For example, one source has monthly data and the other source has quarterly data.
Integration may occur between different types of sources: surveys (mainly web surveys), administrative data, other passive collected data, social network, and other unstructured data.
Integration of different configurations as well as of different types of data sources implies different methodological problems. For instance, integrating survey and administrative data through unit record linkage requires improving coherence across data collections, using standard classifications and questions, rationalizing content between surveys, and processes for combining separate sample surveys into one survey vehicle (Bycroft, 2010).
Example 1.7 discusses an integration of web scraped information, administrative data, and surveys, whereas Example 1.8 shows an application of integration between survey data and social network unstructured information.
As a result of the multisource integration, statistical output is based on complex combinations of sources. Its quality depends on the quality of the primary sources and the ways they are combined. Some studies are investigating the appropriateness of the current set of quality measures for multiple source statistics; they explain the need for improvement and outline directions for further work.
EXAMPLE 1.7 Web scraping, administrative data, and surveys
Istat since 2015 has been experimenting web scraping, text mining, and machine learning techniques in order to obtain a subset of the estimates currently produced by the sampling survey on “Survey on ICT Usage and e‐Commerce in Enterprises” yearly carried out on the web. Studies from Barcaroli et al., (2015, 2016), and Righi, Barcaroli, and Golini (2017) have focused in implementing the experiment and in evaluating data quality.
Trying to make the optimal use of all available information from the administrative sources to web scraping information, web survey estimates produced could tentatively be improved. The aim of the experiment is also to evaluate the possibility to use the sample of surveyed data as a training set in order to fit models to be applied to website information.
Recent and in progress steps are a further improvement to the performance of the models by adding explicative variables consisting not only of single terms, but the joint consideration of sequences of terms relevant for each characteristic object of interest. When a certain degree of quality of the resulting predictive models will be guaranteed, their application to the whole population of enterprises owning a website will be performed. A crucial task will be also the retrieval of the URLs related to the websites for the whole population of enterprises. Finally, once having predicted the values of the target variables for all reachable units in the population, the quality of estimates obtained will be analyzed and compared with the current sampling estimates obtained by the survey. In a simulation study, Righi, Barcaroli, and Golini (2017) found that the use of auxiliary variable coming from the Internet DB source highly correlated with the target variable does not guarantee enhancement of the quality of the estimates if selectivity affects the source. Bias may occur due to absence of some subgroups. Thus an analysis of the DB variable and the study of the relationship between populations covered or not by the DB source is a fundamental step to know how to use and which framework implement to assure high‐quality output.
In conclusion, the approach that uses web scraping and administrative data together with the web survey looks