Report of WP1 of the CENEX on Statistical Methodology

ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data

Report of WP1. State of the art on statistical methodologies for integration of surveys and administrative data

LIST OF CONTENTS

Preface (Mauro Scanu – ISTAT)III

Literature review on probabilistic record linkage1
Statement of the record linkage problem (Marco Fortini - ISTAT)1
The probabilistic record linkage workflow (Nicoletta Cibella, Mauro Scanu, Tiziana Tuoto) 2
Notation and difficulties for probabilistic record linkage (Marco Fortini – ISTAT) 3
Decision rules and procedures (Miguel Guigo – INE)4
Estimation of the distribution of matches and nonmatches (Mauro Scanu – ISTAT) 8
Blocking procedures (Gervasio Fernandez – INE)11
Quality assessments (Nicoletta Cibella, Tiziana Tuoto – ISTAT)16
Analysis of files obtained by record linkage (Miguel Guigo – INE)20
Literature review on statistical matching24
Statement of the problem of statistical matching (Marcello D’Orazio – ISTAT) 24
The statistical matching workflow (Marcello D’Orazio, Marco Di Zio, Mauro Scanu – ISTAT) 25
Statistical matching methods (Marco Di Zio – ISTAT)27
Uncertainty in statistical matching (Mauro Scanu – ISTAT)33
Evaluation of the accuracy of statistical matching (Marcello D’Orazio - ISTAT) 36
Literature review on micro integration processing39
Micro integration processing (Eric Schulte Nordholt – CBS)39
Combining data sources: micro linkage and micro integration (Eric Schulte Nordholt, Frank Linder – CBS) 39
Key reference on micro integration (Miguel Guigo – INE; Paul Knottnerus, Eric Schulte Nordholt – CBS) 42
Other references47
Practical experiences48
Record linkage of administrative and survey data for the EU-SILC survey: the Italian experience (Paolo Consolini – ISTAT) 48
Record linkage applied for the production of business demography data (Caterina Viviano – ISTAT) 51
Combination of administrative, register and survey data for Structural Business Statistics (SBS) – the Austrian concept (Gerlinde Dinges – STAT) 55
Record linkage applied for the computer assisted maintenance of Business Register: the Austrian experience (Alois Haslinger –STAT) 59
The use of data from population registers in the 2001 Population and Housing Census: the Spanish experience (INE Spain) 64
Administrative data source (DBP) for population statistics based on ISEO register in CzechRepublic (Jaroslav Kraus – CZSO) 71
An experiment of statistical matching between Labour Force Survey (RFL) and Time Use Survey (TUS) (Gianni Corsetti - Isfol) 73
Results of the survey on the use and/or development of integration methodologies in the different ESS countries 79
Introduction (Mauro Scanu – ISTAT)79
Objective of the integration process and characteristics of the file to integrate (Luis Esteban Barbado – INE Spain) 80
Privacy issues (Eric Schulte Nordholt – CBS)83
Problems of the integration process and methods (Nicoletta Cibella, Tiziana Tuoto – ISTAT) 86
Software issues (Ondrej Vozár – CZSO)90
Documentation on the integration process (Nicoletta Cibella, Tiziana Tuoto – ISTAT) 91
Possible changes (Alois Haslinger – STAT)92
Possibility to establish links between experts (Alois Haslinger – STAT)94

Annex. Survey on the use and/or development of integration methodologies in the different ESS countries 96

Preface

(Mauro Scanu - ISTAT)

This document is the deliverable of the first work package of the Centre of Excellence on Statistical Methodology, Area Integration of Surveys and Administrative Data (CENEX-ISAD, consisting of the NSIs of Austria, the Czech Republic, Italy, the Netherlands and Spain). The objective of this document is to provide a complete and updated overview of the state of the art of the methodologies regarding integration of different data sources. The different NSIs (within the ESS) can refer to this unique document if they need to:

1)define a problem of integration of different sources according to the characteristics of the data sets to integrate;

2)discover the different solutions available in the statistical literature;

3)understand which problems still need to be tackled, and motivate the research on these issues;

4)look at the characteristics of many different projects that needed the integration of different data sources.

This document consists of five chapters that can be broadly clustered in two groups.

The first three chapters are mainly methodological. They describe the state of the art respectively for i) probabilistic record linkage, ii) statistical matching, and iii) micro integration processing. Each chapter is indeed a collection of references. As a matter of fact, this part of the document is intended as a tool enabling orientation through the wide amount of papers on different integration methodologies. This aspect should not be considered as a secondary issue in the production of official statistics. The main problem is that methodologies for the integration of different sources are, most of the times, still in their infancy. On the contrary, the current informative needs for official statistics require an increasingly more sophisticated use of multiple sources for the production of statistics. Whoever is in charge of a project on integration of different sources must be conscious of all the available alternatives and should be able to justify the chosen method.

The last two chapters are an overview of integration experiences in the ESS. Chapter 4 collects detailed information on many different projects that need a joint use of two or more sources in the participating NSIs of this CENEX. Chapter 5 illustrates the results of a survey on the use and/or development of integration methodologies in the ESS countries. These chapters illustrate the many informative needs that cannot be solved by means of a unique source of information, as well as the peculiar problems that must be treated in each integration process.

WP1

1.Literature review on probabilistic record linkage

1.1. Statement of the problem of record linkage

Marco Fortini (ISTAT)

Record linkage consists in identifying pairs of records, coming from either the same or different data files, which belong to the same entity, on the base of the agreement between common indicators.

The previous figure is taken from Fortini et al. (2006) and shows record linkage of two data sets A and B. Links aim at connecting records belonging to the same unit, comparing some indicators (name, address, telephone). It is possible that some agreement is not perfect (as in the telephone of the first record of the left data set and third record of the right data set), but the records still belong to the same unit.

A classical use of linked data in the statistical research context is the study of the relationships between variables collected on the same individuals but coming from different sources. Other important applications entail the removal of duplicates from data sets and the development and management of registers. Record linkage is a pervasive technique also in a business context where it regards information systems for customer relationship management and marketing. Recently, an increasing interest in e-government applications comes also from public institutions.

Regardless of the record linkage purposes, the same logic is adopted in extreme cases: when a pair of records is in complete disagreement on some key issues it will be almost certainly composed of different entities; conversely, a perfect agreement will indicate an almost certain match. All the intermediate cases, whether a partial agreement between two different units is achieved by chance or a partial disagreement between a couple of records relating to the same entity is caused by errors in the comparison variables, have to be properly resolved depending on the particular approach which is adopted.

A distinction between a deterministic and probabilistic approach is often made in the literature, where the former is associated with the use of formal decision rules while the latter makes an explicit use of probabilities for deciding when a given pair of records is actually a match. The existence of a large number of different approaches, mainly defined in computer science, that make use of techniques based on similarity metrics, data mining, machine learning, etc., without defining explicitly any substantive probabilistic model, makes the previous distinction more subtle. In the present review only the strictly probabilistic approaches will be discussed, given their natural attitude to acknowledge the essential task of matching errors evaluation, whereas Gu et al. (2003) is referenced for a first attempt at an integrated view of recent developments in all the major approaches.

Bibliography

Fortini, M., Scannapieco, M., Tosco, L., and Tuoto, T., 2006. Towards an Open SourceToolkit for Building Record Linkage Workflows. Proceedings SIGMOD 2006 Workshop onInformation Quality in Information Systems (IQIS’06), Chicago, USA, 2006.

Gu, L., Baxter, R., Vickers, D., and Rainsford C., 2003. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003. (

1.2. The probabilistic record linkage workflow

Nicoletta Cibella, Mauro Scanu, Tiziana Tuoto (ISTAT)

Probabilistic record linkage is a complex procedure that is composed of different steps. A workflow (adapted from a workflow described in the record linkage manual of Statistics New Zealand) is the following.

This document reviews in detail the papers on the different steps of the workflow.

Start from the (already harmonized) data sets A and B (see WP2, Section 1, for more details).
Usually the overall set of pairs of records from the data sets A and B is too large, and this causes computational and statistical problems. There are different procedures to deal with this problem, which are listed in Section 1.6.
A necessary step for probabilistic record linkage is to consider how the variables used for matching pairs remain stable from one data set to another. This information is seldom available, but can be estimated from the data sets at hand (Section 1.5).
Consider all the pairs of records in the search space created by the procedure in step 2. Apply a decision rule for each pair of records (results are: link, possible link, no link). This is described in Section 1.4.
Link the two data sets according to the results of the previous step.
Evaluate the quality of the results (Section 1.7).
Analyse the resulting completed data sets, taking in mind that this file can contain matching errors (Section 1.8).

In the following, all the previous steps will be analyzed starting from the core of the probabilistic record linkage problem (Section 1.3), i.e. the definition of the model that generates the observed data, the optimal decision procedure, according to the Fellegi and Sunter theory (Section 1.4), the estimation of the necessary parameters for the application of the decision procedure (Section 1.5). After reviewing these aspects, the procedures for reducing the search space of the pairs of records will be illustrated (Section 1.6). Appropriate methods for the evaluation of the quality of the probabilistic record linkage are outlined (Section 1.7). Finally, the problem of analysing data sets obtained by means of a record linkage procedure is given (Section 1.8).

1.3. Notation and technicalities for probabilistic record linkage

Marco Fortini (ISTAT)

The early contribution to modern record linkage dates back to Newcombe et al. (1959) in the field of health studies, followed by Fellegi and Sunter (1969) where a more general and formal definition of the problem is given. Following the latter approach, let A and B be two partially overlapping files consisting of the same type of entities (individuals, households, firms, etc.) respectively of size nA and nB. Let  be the set of all possible pairs of records coming from A and B, i.e. ={(a,b): aA, bB}. Suppose also that the two files consist of vectors of variables (XA,YA) and (XB,ZB), either quantitative or qualitative, and that XAand XB are sub-vectors of kcommon identifiers, called key variables in what follows, so that any single unit is univocally identified by an observation x. Moreover, let γab designate the vector of indicator variables regarding the pair (a,b) so that γjab=1 in the j-th position if and 0 otherwise, j=1,…,k. The indicators γjab will be called comparison variables.

Given the definitions above we can formally represent record linkage as the problem of assigning the couple (a,b) to either one of the two subsets M or U, which identify the matched and the unmatched sets of pairs respectively, given the state of the vector γab.

Probabilistic methods of record linkage generally assume that observations are independent and identically distributed observations from appropriate probability distributions. Following Fellegi and Sunter (1969), there is a first bivariate random variable that assigns each pair of records (a,b) to the matched records (set M) or to the unmatched ones (set U). This variable is latent (unobserved), and it is actually the target of the record linkage process. Secondly, the comparison variables γab follow distinct distributions according to the pair status. Let m(γab) be the distribution of the comparison variables given that the pair (a,b) is a matched pair, i.e. (a,b)M, and u(γab) be the distribution of the comparison variables given that the pair (a,b) is an unmatched pair, i.e. (a,b)U. These distributions will be crucial for deciding the record pairs status.

Bibliography

Fellegi, I. P., and A. B. Sunter,1969. A theory for record linkage. Journal of the American Statistical Association, Volume 64, pp. 1183-1210.

Newcombe, H., Kennedy, J., Axford, S. and James, A., 1959. Automatic Linkage of Vital Records, Science, Volume130, pp. 954–959.

1.4. Decision rules and procedures

Miguel Guigo (INE)

1.4.1. General statement of the statistical problem

The key task of all the record linkage process is to determine whether a pair of records belongs to the same entity or not; hence, the quality of the whole result achieved by the linkage procedure relies on the quality of the tool applied to make this choice, that is, the decision rule.

From a statistical point of view, following De Groot (1970), a decision problem consists of an experiment the actual outcome of which is unknown, and the consequences of which will depend on that outcome and a decision taken by the statistician. Specifically, let D be the space of all possible decisions d which might be made, let  be the space of all possible outcomes  of the experiment, and let R be the space of all possible rewards or results r = r(,d) of the statistician decision d and the outcome  of the experiment. In most cases, r is actually a loss function.

We also assume that there exists a probability distribution P on the space  of outcomes whose value P() is specified for each event . Then, the statistician must choose an optimal non-deterministic behaviour in an incompletely known situation. A way to face this is to minimize the expectation of the total loss, and then the decision rule is optimal (Wald, 1950); but the statistician must also face a problem with respect to the probability distribution P, which is known to belong to a family of probability distributions, but some of whose parameters are unknown; by making some observations of the phenomenon and processing the data, the statistician has to make a decision on P. Therefore, a statistical decision rule is a transition probability distribution from a space of outcomes  into a space of decisions D[1].

In the case of a record linkage procedure, the space of actual outcomes consists of a real match or a real nonnmatch for every pair of records belonging to ={(a,b): aA, bB}, and the space D of all possible decisions consists of assigning or not the pair as a link.

In this context, the decision rule can be seen as made up of a two-step process, where the first stage is to organize the set of agreements between common identifiers for each pair of records (a,b) in an array γab. This means a mapping from  on , where  is known as space of comparisons. A function that returns a numerical comparison value for γjab multiplied by a weight wj, gives a basic score on the level of coincidence for the j-th key variable, which sets the contribution of every common identifier. Procedures for measuring agreement between records (a,b) will then result in a composite weight of their closeness. Patterns can be more or less arbitrary, based on distance, similarity, or linear regression models, amongst others. For a more complete list of comparators, see Yancey (2004b).

Newcombe et al. (1959), and Fellegi and Sunter (1969) consider the different amount of information provided by each key variable, by means of using a log-likelihood ratio taking into account the agreement probabilities. This is considered the standard procedure, as shown below. From m(γab) and u(γab) as defined in the previous section, each pair is assigned the following weight: wab = log (m(γab) / u(γab)).

Once a weighted measure of agreement is set, the following step is in its turn a mapping from  on a space of states which consists of the following decisions: A1 (that is, a link), A3 (that is, a non-link), and A2 (that is, a possible link), with related probabilities given that (a,b) U or (a,b) M, which can also be derived from the probability distributions m(γab) and u(γab) and the regions of  associated to each decision. As the weighted score increases, the associated pair (a,b) is more likely to belong to M. So, on the one hand, given an upper threshold, a bigger numerical comparison value for γab will lead to consider the pair as a link; and, on the other hand, a smaller comparison value, given a lower threshold, will lead to consider it as a non-link.

Taking into account both steps, the problem of record linkage and the decision rule can be faced up as a common statistical hypothesis test with a critical and an acceptance region, which are obtained through the different values of γ in  and their respective composite weight values on R, compared with a set of fixed bounds. A probability model based on [m(γab),u(γab)] in order to calibrate the error rates, i.e.  = P{A1/(a,b) U}and  = P{A3/(a,b) M}is therefore also needed.

At this point, it is important to remark that, while consists of only two disjoint subsets M or U, thespace of decisions is split into three subsets due to the fact that probability distributions of matches and non matches are partially overlapping. Then, for possible links, when A2 is achieved, a later clerical review of the ambiguous results will be needed, in order to appropriately discriminate these intermediate results between the link cases and non-link cases. An intuitive idea is that, if the main reason to implement an automatic record linkage procedure is to avoid or reduce costs, time wasting, or errors due to the use of specifically trained staff to link records manually, the larger A2 is, the bigger those costs, time consumption and errors are, and the worst the decision rule is. So, the optimal linkage rule has to maximize the probabilities of positivedispositions of comparisons -that is to say, positive links A1 and positive non-linksA3- for a given pair of fixed errors  and .