WGMApril05/EN
WG MEthodology
5 APRIL 2017
Item 4.2of the agenda
Selectivity in Big Data Sources
Eurostat1WGM April2017/EN
1 Background
The Scheveningen Memorandum[1] specifically acknowledges that the use of big data in the context of official statistics requires new developments in methodology and that the ESS should make a special effort to support these developments.
Eurostat has launched several initiatives to explore the potential of big data and to identify its challenges. In particular, during the running of the pilots one general challenge found was the selectivity of the big data sources used.
To specifically address the selectivity of the big data sources used in its own pilots and, at the same time, to serve as guidance for Eurostat in planning future development activities (internally and at ESS level), Eurostat launched a study to provide an overview of methods for adjusting big data sources for selectivity.
Whereas the methodology activities of the ESSnet Big Data are quite general and overarching, this study was much more specific, addressing only selectivity; this allowed gaining some insight much quicker.
2. The study on methods for treating selectivity in big data sources
The main objective of the study was to identify existing methods which could be used to address the selectivity in big data sources, in order to be able to make unbiased inference for populations of interest in official statistics (e.g. resident population between 15 and 65 years old).
The selectivity issues, and the possibilities to address them, take into account big data sources in general. However, the big data sources described in the report are taken as a starting point. In particular, the data sources being explored by Eurostat, mobile phone network data and social media, are addressed.
The study lists the methods, as comprehensively as possible, including their advantages and disadvantages as well as the level of maturity of each of them (i.e. whether it is readily applicable or in need of further development: methodological as well as in terms of software tools.)
To achieve these objectives, the study undertook the following activities:
- Activity 1 - Analysis of big data sources. Based on the description of particular big data sources and on a literature review of selectivity in big data sources, this activity consisted of an identification of the type of selectivity found in big data sources. The characteristics of the sources should be noted, since they may have an implication on the applicability of methods to address selectivity (e.g. the existence or not of background characteristics). The big data sources covered are mobile network geolocation data, one case of social networks data (Twitter) and two cases of web activity data (Google Trends and Wikipedia page views).
- Activity 2 - Summary of literature on selectivity treatment methods.This activity was composed of two sub-activities. The first one was a wide literature research on methods to deal with selectivity. This literature research spanned beyond the methods most commonly used in statistical offices and searched for potential methods in other statistical domains. The second sub-activity was a short summary of the relevant literature. The result of this activity will be made available online (in the Big Data section[2] of the CROS portal).
- Activity 3 - Analysis and evaluation of methods.Based on the description of the particular big data sources, including the specific causes of selectivity identified in activity 1 and on the literature summary conducted in activity 2, this activity consisted of a selection and more in-depth reviews of the relevant methods. A classification of the literature review into unit-level methods and domain-level methods was accomplished. In this activity, an assessment of what could actually be done was carried out, to know whether there is anything ready to be deployed or on the contrary if further developments are needed (e.g. software tools, further research).
3. Planned outcomes
Following the study, a report will be published during 2017 focusing on the results of the activities 1 and 3. The results of Activity 2 will be made available online on CROS. The report will define big data from a statistical point of view then look at a selected group of big data sources in more detail: mobile network geolocation data, Twitter, Google Trends and Wikipedia page views.For these sources a detailed description of the data source is provided, how the data is generated, what is the population that uses the platform which originates the data sources and finally what kind of self-selection (informative or non-informative) might be observed in the data.
Methods on unit and domain levels both based on pseudo design and modelling approach that are thought to be useful for selectivity adjustment are briefly described.Their suitability for the selected data sources is discussed.
Some relevant issues connected with the nature of the sources such as unit error, the uncertainty connected with the transformation of objects into units, the measurement error in auxiliary variables are raised but not addressed in this study.
Further actions are under consideration, such as concrete methodological developments, development of statistical tools, as well as actions at European level.
Eurostat1WGM April2017/EN
[1]Scheveningen Memorandum – Big Data and Official Statistics
[2]