Proposals for linking Big Data and statistical registers
Daniela Fusco ()[1], Antony Rizzi, Tiziana Tuoto
Keywords:data linking, geo-referenced data linkage, Big data
1.Introduction
The use of Big Data for statistical purpose is an opportunity for National Statistical Institutes. The growing request of “speedy” and updating data can be satisfied by the three Big Data key: Volume, Velocity and Variety. In some cases the connection between big data objects and statistical units is simple. In these cases is possible take advantage of the Internet-scraped data to enrich, bring up-to-date and confirm the information of the traditional statistics. It happens in presence of a unique linkage key, but in real life it doesn’t occur very often. Internet-scraped data are often full of errors or lacking information in the record identifiers, besides statistics data may contain other kind of errors, so sophisticated linkage techniques are requested. In this study, starting from a case study, we suggest novelties and challenges in integrating Internet-scraped data with non-traditional statistical datasets: as case study we propose an integration of Internet-scraped data regarding agritourism with data reported in the Farm Register (SFR) built up by Istat, that is obtained by the integration of several administrative and statistical sources. We propose to explore new techniques not yet introduced in the official statistics production system: a hybrid linkage strategy obtained by the use of traditional demographic information (i.e. Address, Name) and geographical data.
2.Methods
The record linkage is the set of techniques and methods for recognizing the same real-world entity, even if differently represented in separated data sources. The record linkage techniques, in particular the probabilistic record linkage, are widespread in official statistics given their utility in the case where a unit unique identifier is not available and the identifying variables are affected by errors, including missing values.
Generally a problem of record linkage can be seen as a classification problem in which all the pairs generated by the Cartesian product of all the records of the two (or more files) to be integrated must be classified into two disjoint sets, the set of Matches and the set of Unmatches. According to the most popular reference for record linkage, the Fellegi-Sunter (1969) theory [1], the pairs classification criterion is based on K common identifiers (matching variables). For each pair, a comparison function is applied in order to obtain a comparison vector for every (a,b) pair. The ratio between the probabilities of given the pair (a,b) membership either to the subset M or U is used so as classifying the pair. In practice, once the probabilities m and u are estimated, for instance by means of the EM algorithm, all the pairs can be ranked according to their ratio r=m/u in order to detect which pairs have to be matched by means of a classification criterion based on two thresholds Tmand Tu (TmTu), chosen so to minimize false match rate and false non-match rate, respectively
2.1.Georeferenced approach
The combined use of data coming from different sources is an ordinary problem for NSIs. Linkage methods typically compare different variables of the units using a set of distance measures. The variables choice is a main problem, even if the choice is often obliged by the limited number of common information available on the input data. In addition, in our specific case, it is likely that the Internet-scraped data are not standardized or codified. A georeferenced approach could support the data integration in absence of satisfactory variables number or in presence of low quality variables [2].
2.2.Combined methods approach
The aim of this work is to test the feasibility of different approaches to the linkage of Agritourism Farms data scraped via web with the Italian SFR. In particular the methodologies that we considered can be applied selectively and specifically to the different phases of the record linkage, from the pre-processing step to the estimation of the parameters for the classification rule. In this paper, we introduce the following contributions: firstly, for a proper identification of the links between each AF in the Italian SFR and in the hubs websites, we test a method of parsing the strings of characters that constitute the data scraped from the web. In fact the information that can be used, i.e. the addresses and the denominations of the AF, are often not correctly or uniquely indicated. Moreover, we apply the SimHash algorithm [3] as comparison function: this algorithm, originally used to recognise duplicated but not identical web documents, leads to very good results when used for the reduction of pairs search space in traditional record linkage problems, i.e. linking survey and administrative data[4].
Additionally, we propose an application ofthe Multinomial EM Algorithm [5], an extension of the traditionalEM algorithm: instead of two categories for each matching variable we define k categories, where each category represents a class based on an interval of string comparators. If there is an agreement for variable q in class k of the string comparator for a pair j, then the comparison vector equals to1,otherwise it is zero. The EM algorithm under the multinomial distribution is now used to estimate the match parameters for each variable qin class k. Generating m and u probabilities for each similarity score level (for each key variable) has this advantage: although alternative similarity scores might be more or less suitable for certain types of variable, once the score is chosen we need only consider the appropriateness of the categorization. The similarity scores are incorporated into the fitting process, so the parameter estimates are easily interpretable.
Finally, we give an overview of the implementation of the Agritourism Farmsdata in the Silk Workbench, which is part of the Silk Link Discovery Framework. In [6]is shown how this framework supports learning linkage rules using the active learning approach: for each iteration, the user has to examine the 5 most uncertain links shown by the software. After the user declined or confirmed this set of link, the Workbench evolves the current population of linkage rules: the user can experiment and see how changes of the linkage rule affect the accuracy of the generated links. The main expected advantage consists in the reduction of the manual effort of interlinking data sources by automating the generation of linkage rules. The user is only required to perform the much simpler task of confirming or declining a set of link candidates which are actively selected by the learning algorithm to include link candidates that yield a high information gain.
3.Results
The frame of SFR Agritourism is about 13,000 units on a total of 20,000 existing for the Agricultural Ministry (year 2013).We propose an integration of Internet-scraped data regarding agritourist farms (AFs) with data reported in the Farm Register built up by Istat. The initial and most important target of web scraping is represented by the different websites acting as “hubs” (hosting and describing a plurality of Agritourism), in general maintained by private societies, or by business associations. Obviously some AFs could be present in more than one source.In table 1 are reported the number of possible AFs extracted by the main sectorial websites:
Table 1. Number of AFs by main sectorial web sites
Main URL of the hub web site / Number of AFs/ 3,520
/ 2,292
/ 7,575
/ 4,389
/ 2,636
/ 1,514
/ 618
The information that came from the websites are often hard to combine since errors or missing information in the record identifiers. The final aim of this integration is the use of Internet information for statistical purpose, in particular to update and integrate some data collected in the SFR for the AFs. In table 2 are explained some variables available on internet useful for these purpose.
Table 2. Variables scraped by internet by topics
Topic / Variable / Updating / Additional informationFarm localization and contacts / Address / X
Telephone number / X / X
e-mail / X / X
Web site / X / X
Geo-localization / X
Structural information / Number of rooms / X / X
Prices / X
Number of restaurant seats / X / X
Additional information / Direct sales / X / X
Product typology / X / X
E-commerce / X
Specifically, at the end of the integration process, it will be possible to:
- Validate the addresses in SFR and identify them if they are missing;
- Estimate the variables available on the net (e-commerce, price, etc.) to add other information in the SFR;
- Check and integrate information of the SFR (telephone number, e-mail, web site, etc.).
4.Conclusions
In this paper, new methodologies to integrate data scraped from the web and administrative sources in official statistics are analysed. The proposals are examined in the context of agricultural statistics, for the updating and the improvement of the farm register and the substitutability of some information coming from sample survey. The agricultural field is the most challenging area for evaluating the performance of new linkage methodologies, due to the well-known difficulties in recognising statistical units related to this field as well as rural addresses. Dealing with new sources of data requires the availability of new methodologies in linking data, however the due attention should be devoted to the output quality evaluation, to better understand benefits and risks of the integration and to allow the analysts to take into account potential integration errors in subsequent analyses. In this paper, we experiment and compare the use of GPS coordinates as matching variables, and for spatial linkage as well. Moreover, we introduce some machine learning algorithms in order to test their effectiveness to deal with un-structured data and the advantages of these algorithms with respect to traditional standardization and parsing activities on linkage variables. In addition, these features are compared with some innovations in the traditional approach to probabilistic record linkage. At the end, the statistical validation of the linkage results and the measurement of output quality need to be assessed.
References
[1]Fellegi, I.P., Sunter, A.B. (1969), A Theory for Record Linkage, Journal of the American Statistical Association, 64, pp. 1183-
[2]MacEachren, A. M. et al. (2005), Visualising Geospatial Information Uncertainty: what we know and what we need to know, Cartography and Geographic Information Science, Vol. 32, No. 3, pp. 139-160
[3]CharikarM.S. (2002), Similarity estimation techniques from rounding algorithms, Proceedings ACM STOC ’02, May 19-21 2002, Montreal, Quebec,Canada, 380–388.
[4]Mancini, L., Valentino, L., Borrelli, F. e Marcone, L. (2012) Record linkage between large datasets: Evidence from the 15th Italian Population Census, In Special Issue International Conference on "Methods and Models for Latent Variables" MMLV2012, Naples, 17-19 May 2012. Quaderni di Statistica, vol. 14, p. 149-152, Napoli, Italy, Liguori Ed.
[5]D. Smith and N. Shlomo, Privacy Preserving Record Linkage, Data Without Boundaries Deliverable D11.2, Report 2014-01, CMIST Working Paper (2014)
[6]R. Isele and C. Bizera (2013), Active Learning of Expressive Linkage Rules using Genetic Programming.
1
[1]Istat- Italian National Institute of Statistics