Software Tools for Integration Methodologies

ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data

Report of WP3. Software tools for integration methodologies

LIST OF CONTENTS

Preface III

1. Software tools for record linkage (Monica Scannapieco – Istat) 1

1.1. Comparison criteria for record linkage software tools (Monica

Scannapieco – Istat) 2

1.2. Probabilistic tools for record linkage 4

1.2.1. Automatch (Nicoletta Cibella – Istat) 4

1.2.2. Febrl (Miguel Guigo – INE) 5

1.2.3. GRLS (Nicoletta Cibella – Istat) 6

1.2.4. LinkageWiz (Monica Scannapieco – Istat) 7

1.2.5. RELAIS (Monica Scannapieco – Istat) 7

1.2.6. DataFlux (Monica Scannapieco – Istat) 8

1.2.7. Link King (Marco Fortini – Istat) 10

1.2.8. Trillium Software (Miguel Guigo – INE) 10

1.2.9. Link Plus (Tiziana Tuoto – Istat) 12

1.3. Summary tables and comparisons 14

1.3.1. General Features 14

1.3.2. Strengths and weaknesses 16

2. Software tools for statistical matching (Mauro Scanu – Istat) 17

2.1. Comparison criteria for statistical matching software tools (Mauro

Scanu – Istat) 17

2.2. Statistical matching tools 18

2.2.1. SAMWIN (Mauro Scanu – Istat) 19

2.2.2. R codes (Marcello D’Orazio – Istat) 19

2.2.3. SAS codes (Mauro Scanu – Istat) 20

2.2.4. S-Plus codes (Marco Di Zio – Istat) 20

2.3. Comparison tables 22

3. Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár - CZSO) 24

3.1. Data quality standardization requirements 24

3.2. Data quality assessment 24

3.3. Summary tables Oracle, Netrics and SAS/Data Flux 29

3.3.1. Oracle 29

3.3.2. Netrics 32

3.3.3. SAS 35

4. Documentation, literature and references 38

4.1. Bibliography for Section 1 38

4.2. Bibliography for Section 2 39

4.3. Bibliography for Section 3 40

Preface

This document is the deliverable of the third work package (WP) of the Centre of Excellence on Statistical Methodology. The objective of this WP is to review some existing software tools for the application of probabilistic record linkage and statistical matching methods

The document is organized in three chapters.

The first chapter is on software tools for record linkage. On the basis of the underlying research paradigm, three major categories of record linkage tools can be identified:

· Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi and Sunter[1], 1969).

· Tools for empirical record linkage, which are mainly focused on performance issues and hence on reducing the search space of the record linkage problem by means of algorithmic techniques such as sorting, tree traversal, neighbour comparison, and pruning.

· Tools for knowledge-based linkage, in which domain knowledge is extracted from the files involved and reasoning strategies are applied to make the decision process more effective.

In such a variety of proposals, this document restricts the attention to the record linkage tools that have the following characteristics:

· They have been explicitly developed for record linkage;

· They are based on a probabilistic paradigm.

Two sets of comparison criteria were used for comparing several probabilistic record linkage tools. The first one considers general characteristics of the software: cost of the software; domain specificity (i.e. the tool can be developed ad-hoc for a specific type of data and applications); maturity (or level of adoption, i.e. frequency of usage - whereas available - and number of years the tool is around). The second set considers which functionalities are performed by the tool: preprocessing/standardization; profiling; comparison functions; decision method.

Chapter 2 deals with software tools for statistical matching. Software solutions for statistical matching are not as widespread as in the case of record linkage, because statistical matching projects are still quite rare in practice. Almost all the applications are conducted by means of ad hoc codes. Sometimes, when the objective is micro it is possible to use general purpose imputation software tools. On the other hand, if the objective is macro, it is possible to adopt general statistical analysis tools which are able to deal with missing data.

In this chapter, the available tools, explicitly devoted to statistical matching purposes, were reviewed. Only one of them (SAMWIN) is a software that can be used without any programming skills, while the others are software codes that can be used only by those with knowledge of the corresponding language (R, S-Plus, SAS) as well as a sound knowledge in statistical methodology.

The criteria used for comparing the software tools for statistical matching were slightly different from those for record linkage. The attention is restricted to costs, domain specificity and maturity of the software tool. As far as the software functionalities are concerned, the focus is on: i) the inclusion of pre-processing and standardization tools; ii) the capacity to create a complete and synthetic data set by the fusion of the two data sources to integrate; iii) the capacity to estimate parameters on the joint distribution of variables never jointly observed; iv) the assumptions on the model of the variables of interest under which the software tool works (the most known is the conditional independence assumption of the variables not jointly observed given the common variables in the two data sources); v) the presence of any quality assessment of the results.

Furthermore, the software tools are compared according to the implemented methodologies. Strengths and weaknesses of each software tool are highlighted at the end.

Chapter 3 focuses on commercial software tools for data quality and record linkage in the process of microintegration. The vendors in the data quality market are often classified within their entire position in IT business, where focus on the specific business knowledge and experience in specific business domain plays an important role. Quality of vendors and their products on the market are characterized by: i) product features and relevant services; ii) vendor characteristics, domain business understanding, business strategy, creativity, innovation; iii) sales characteristics, licensing, prices; iv) customer experience, reference projects; v) data quality tools and frameworks.

The software vendors of tools in the statistics oriented “data quality market” propose solutions addressing all the tasks in the entire life cycle of the data oriented management programs and projects: data preparation, survey data collection, improving of quality and integrity, setting up for reports and studies, etc.

According to the software/application category, the tools to perform or support the data oriented projects in record linkage in statistics should have several common characteristics:

1) portability in being able to function with statistic researchers' current arrangement of computer systems and languages,

2) flexibility in handling different linkage strategies, and

3) operational expenses or low costs in TCO (Total Cost of Ownership) parameters and in both, computing time and researchers' efforts.

In this chapter the evaluation focused on three commercial software packages, which according to the data quality scoring position in Gartner reports (the so called “magic quadrants” available on the web page http://www.gartner.com) belong to important vendors in this area. The three vendors are: Oracle (represents the convergence of tools and services in the software market), SAS/DataFlux (data quality, data integration and BI (Business Intelligence) player on the market), Netrics (which disposes with the advanced technology complementing the Oracle data quality and integration tools).

The set of comparison tables was prepared according to the following structure: linkage methodology, data management, post-linkage function, standardization, costs and empirical testing, case studies and examples.

WP3

1 Software Tools for Record Linkage

Monica Scannapieco (Istat)

The state of the art of record linkage tools includes several proposals coming from private companies but also, in large part, from public organizations and from universities.

Another interesting feature of such tools is related to the fact that some record linkage activities are performed “within” other tools. For instance, there are several data cleaning tools that include record linkage (see Barateiro and Galhardas, 2005 for a survey), but they are mainly dedicated to standardization, consistency checks etc. A second example is provided by the recent efforts by major database management systems’ vendors (like Microsoft and Oracle) that are going to include record linkage functionalities for data stored in relational databases (Koudas et al., 2006).

On the basis of the underlying research paradigm, three major categories of tools for record linkage can be identified (Batini and Scannapieco, 2006):

1. Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi and Sunter, 1969).

2. Tools for empirical record linkage, which are mainly focused on performance issues and hence on reducing the search space of the record matching problem by means of algorithmic techniques such as sorting, tree traversal, neighbour comparison, and pruning.

3. Tools for knowledge-based linkage, in which domain knowledge is extracted from the files involved, and reasoning strategies are applied to make the decision process more effective.

In such a variety of proposal, in this document we concentrate on record linkage tools that have the following characteristics:

· they have been explicitly developed for record linkage

· they are based on a probabilistic paradigm.

In the following, we first illustrate a set of comparison criteria (Section 1.1) that will be used for comparing several probabilistic record linkage tools. In Section 1.2, we provide a general description of the selected tools, while in Section 1.3 we present several comparison tables that show the most important features of each tool.

Let us first list the probabilistic record linkage tools that have been selected among the most well-known and adopted ones:

1. AutoMatch, developed at the US Bureau of Census, now under the purview of IBM [Herzog et al. 2007, chap.19].

2. Febrl - Freely Extensible Biomedical Record Linkage, developed at the Australian National University [FEBRL].

3. Generalized Record Linkage System (GRLS), developed at Statistics Canada [Herzog et al. 2007, chap.19].

4. LinkageWiz, commercial software [LINKAGEWIZ].

5. RELAIS, developed at ISTAT [RELAIS].

6. DataFlux, commercialized by SAS [DATAFLUX].

7. The Link King, commercial software [LINKKING].

8. Trillium, commercial software [TRILLIUM].

9. Link Plus, developed at the U.S. Centre for Disease Control and Prevention (CDC), Cancer Division [LINKPLUS].

For each of the above cited tools, in the following Section 1.2 we provide a general description.

1.1 Comparison criteria for record linkage software tools

Monica Scannapieco (Istat)

In this section we describe the criteria to compare the probabilistic record linkage tools with respect to several general features. Such criteria will be reported in some tables, whose detail is provided in Section 1.3.

A first table will take into account the following characteristics:

· Free/Commercial, refers to the possibility of having the tool for free or not. The set of possible answers is shown in Figure 1.

· Domain Specificity, refers to the fact that the tool can be developed ad-hoc for a specific type of data and applications. For Domain Specificity, the set of answers is shown in Figure 2.

· Maturity (Level of Adoption), is related to the frequency of usage (whereas available) and to the number of years the tool is around. For Maturity, we use a HIGH/MEDIUM/LOW rating scale. In order to assign the rates, we take into account the following factors: (i) frequency of usage (Shah et al.) (ii) number of years since the tool has been first proposed.

Figure 1: Free/commercial possible answers

Figure 2: Domain specificity possible answers

A second table will consider which functionalities are performed by the tool, on the basis of a reference set of functionalities, listed in the following.

Preprocessing/Standardization

Data can be recorded in different formats and some items may be missing or with inconsistency or errors. The key job of this functionality is to convert the input data in a well defined format, resolving the inconsistencies in order to reduce misclassification errors in the subsequent phases of the record linkage process. In this activity null string are cancelled, abbreviations, punctuation marks, upper/lower cases, etc. are cleaned and any necessary transformation is carried out in order to standardize variables. Furthermore the spelling variations are replaced with standard spelling for the common words. A parsing procedure that divides a free-form field into a set of strings, could be applied and a schema reconciliation can be performed to avoid possible conflicts (i.e. description, semantic and structural conflicts) among data source schemas in order to have standardized data fields. Geocoding, a standardization task especially conceived for name and address data, transforms data variables assigning geographic identifiers or postal standards such as postal ZIP codes or official street addresses.

Profiling

An important phase of a record linkage process is the choice of appropriate matching variables that have to be as suitable as possible for the linking process considered. The matching attributes are generally chosen by a domain expert, hence this phase is typically not automatic but the choice can be supported by some further information that can be automatically computed. Such information is the result of a profiling activity that provides quality measures, metadata description and simple statistics on the distribution of variables which give hints on how to choose the set of matching variables.

Comparison Functions

Record linkage tools can provide support for different comparison functions. Some of the most common comparison functions are equality, edit distance, Jaro, Hamming distance, Smith-Waterman, TF-IDF, etc. (see Koudas et al, 2006 for a survey)

Search Space Reduction

In a linking process of two datasets, say A and B, the pairs needed to be classified as matches, non-matches and possible matches are those in the cross product A x B. When dealing with large datasets, the comparison of the matching variables is almost impracticable. Several techniques based on sorting, filtering, clustering and indexing may be all used to reduce the search space.

Blocking and sorted neighbourhood are among the common ones.

Decision Method

The core of record linkage process is the choice of decision model. A record linkage tool can provide several decision rules in order to decide the status of match, nonmatch, or possible match of records. For instance, it can provide the support for the Fellegi and Sunter rule, but also for a deterministic (threshold based) rule.

In Section 1.3, two further tables, namely for estimation methods and strenghts and weaknesses of the tools, will be described.