Rome – Italy 26 – 31 October 2003
Session No 4
Bill Iwig, Kara Daniel, Tom Pordugal, and Stan Hoge
National Agricultural Statistics Service, USA
Using Reported Data as Matching Variables in Record Linkage
Key Words: Record Linkage, Duplication, SuperMatch
1. Introduction
The National Agricultural Statistics Service (NASS) gathers information concerning various aspects of agriculture through a system of surveys. The samples for most of these surveys are drawn from the NASS Farm Register of farm operators and agri-businesses. Separate registers are maintained for each state by NASS’s State Statistical Offices (SSOs).
Maintaining a current and accurate Farm Register is a time-consuming and expensive process. The goals of maintaining high coverage of all farm operations and low duplication are often competing against each other since farm operations in the United States do not have official registered names. Consequently, a farming operation may be recorded under “name A” on one list source and the same operation may be under “name B” on another list source. New list sources are matched against the master Farm Register based on name, address, and other information using probabilistic record linkage methodology as described in the paper “Record Linkage at NASS Using AutoMatch” presented at the 13th International Roundtable on Business Survey Frames. AutoMatch is a previous version of the commercial software, SuperMatch, which serves as the matching engine in the NASS record linkage system. This software uses the record linkage theory proposed by Ivan Fellegi and Alan Sunter in a 1969 Journal of the American Statistical Association paper[1]. New source records which are determined to be non-matches are added to the Farm Register in order to improve coverage. Record linkage can also be used to identify duplication within a list where the list is matched against itself.
Recently, NASS has employed a feature of the record linkage software which enables the use of data values as matching variables. The initial use of this enhancement is to identify duplication of reported data on the 2002 Census of Agriculture data file. Data records which appear to have different identification information but similar reported data are identified as potential duplicates and reviewed by NASS statisticians as part of the data review process. This methodology can also be used to identify matched records between a new list source and a master list when corresponding data items are present on both lists.
This paper presents an overview of the steps taken to remove duplication from the Census Mail List, describes this software feature using data values as matching variables, and presents some results based on 2002 Census of Agriculture data.
2. Identifying Duplication on the Census of Agriculture Mail List
The Census Mail List (CML) for the 2002 Census of Agriculture was extracted from the NASS Farm Register on September 1, 2002 and contained approximately 2.85 million records. Extensive efforts were made during 2002 to maximize farm coverage and to eliminate duplication among potential CML records as the list was developed. The NASS record linkage system was used to merge new list sources with the existing Farm Register with the intent of avoiding duplication with records already on the register. In addition, the record linkage system was used to check for within list duplication based on name, address, and other identification information prior to the CML creation. It was difficult to make correct decisions on whether two slightly different names represented the same farm operation or different operations without actual data. In general, the goal of maximizing farm coverage resulted in many potential duplicates being included on the CML with the expectation that remaining duplication would be identified and removed during census data processing.
Several steps were taken to identify and remove duplication after the Census questionnaires were returned. Respondents receiving multiple questionnaires for the same operation were instructed to return all the questionnaires in the same envelope. Clerks were careful to identify these returned duplicate records. Some respondents used the census toll-free telephone number to report receiving duplicate reports. In addition, new questions were added to the 2002 Census of Agriculture questionnaire asking for additional names and/or addresses associated with an operation. The information reported for these questions was very helpful in determining if returned forms provided duplicate data for a farm operation. Approximately 50,000 duplicates were identified and removed during initial census processing.
3. Using Record Linkage to Identify Potential Duplication on the 2002 Census of Agriculture Data File Based on Reported Data
After census data were captured and available for data processing, additional steps were taken to remove duplication of records on the census data file. One of those steps used the NASS record linkage system to identify potential duplication based on similar name and address information. The record linkage parameters were set conservatively so that only the most likely duplicate pairs were identified. Almost all of these records were also identified as possible duplicates on the CML prior to the census mail-out. This record linkage was intended to provide one last review of these records with the benefit of having reported data to determine whether the records represented the same farming operation or different operations. Another step used a feature of the SuperMatch record linkage software to identify potential duplication of records simply based on similar reported data. Occasionally two different census report forms are received with completely different names, such as a landlord and a tenant, but representing the same farm operation. This step was intended to identify such matched records.
The Fellegi and Sunter record linkage methodology involves dividing the matching variables of each record into components. Typically, the matching variables are name, address, or other identification information and the components could be first name, last name, house number, SSN, or other specific items. If a component matches for each record, the component in assigned an agreement weight based on other record linkage parameters. If the component does not match, the component is assigned a disagreement weight. The pair of records is determined to be a match or non-match based on the sum of the agreement and disagreement weights as compared to a specified threshold value.
When conducting record linkage based on data items, each data item is considered a component. If the data values exactly match for two records, the component is assigned the full agreement weight. The SuperMatch software also allows data item components to be considered as matches if their values differ by less than a specified percentage, but in that case, the component receives a reduced agreement weight. In this way, similar data values contribute to the likelihood of two records being classified as matches.
To conduct the record linkage duplication check of the census data file, the records were blocked based on the reported principal county of the farm operation. There are 3,079 counties in the U.S., which are geo-political entities. The principal county of a farm is the county in which the majority of the agricultural products for the farm was produced. Only records within the same block were compared with one another in the record linkage software. When conducting record linkage on large files, such as the census data file, it is not possible to examine every possible match pair. Assigning records to blocks makes the task of matching more manageable.
The values of 40 data items for each possible match pair of records within a principal county were compared. This is the maximum number of variables that the SuperMatch software is designed to handle for these comparisons. The record linkage parameters were set such that if two positive values for a data item matched exactly, an agreement weight of one was assigned for that item. If the two values were within eleven percent of each other, a prorated agreement weight less than one was assigned for that item. And if the values differed by eleven percent or more, a disagreement weight of zero was assigned. This “non-tolerable” percentage difference is a record linkage parameter that needs to be assigned for each application. The choice of eleven percent for the parameter value was based procedures used for the 1997 census and the decision that a ten percent difference was a reasonable response difference for the same farm operation.
A component weight for a data item was only assigned if a positive value was available for both data records. Consequently, a data item that has a value of zero for both records does not add to the agreement weight.
The general calculation of the prorated agreement weight is as follows[2],3.
Wt(pro) = max(Wt(agr) – ((((B – A) / min(A,B)) * 100) / P(diff)) * (Wt(agr) – Wt(disagr))), Wt(disagr))
Where:
Wt(pro)= prorated agreement weight
Wt(agr)= original agreement weight
Wt(disagr)= disagreement weight
P(diff)= non-tolerable percent difference
A= positive value of data item for record A
B= positive value of data item for record B
Values for the census unduplication record linkage application are the following.
Wt(agr)= 1
Wt(disagr)= 0
P(diff)= 11
Consequently, the value of Wt(pro) will be between 0 and 1. A difference of eleven percent or more between the record A and record B values results in a weight of zero. Table 1 shows the prorated agreement weight values when comparing a value of 100 to values ranging from 99 to 90.
Table 1: Examples of Prorated Agreement Weights
For A = 100 and B = 90 to 99
A / B / Wt(pro)100 / 99 / .91
100 / 98 / .81
100 / 97 / .72
100 / 96 / .62
100 / 95 / .52
100 / 94 / .42
100 / 93 / .32
100 / 92 / .21
100 / 91 / .10
100 / 90 / .00
As mentioned previously, 40 data items were compared for each possible match pair, which included land utilization, crop acres, livestock numbers, and economic variables. Most records had positive values for less than 10 of these items. The sum of the agreement and disagreement weights was calculated for each pair of records within each principal county. Record pairs which produced a sum of weights of at least five were identified as possible matches and output for review by statisticians in the SSOs to determine if they were actual duplicates. Consequently, if five or more data items exactly matched, the record pair would be identified as a possible match. The cutoff of five was used for the 1997 census and was considered reasonable to identify potential duplicates, but is not based on any empirical evidence. Resolving the possible duplication might have required telephone follow-up to at least one of the respondents.
4. Results of Census Unduplication Record Linkage Based on Census Data
The unduplication record linkage processing for a State was conducted when the response rate reached approximately 75%. This included a large majority of the final data but also allowed the SSOs to start their review before data were final in order to keep on the census schedule. The census data file contained approximately 1.1 million records with positive data at the time of the record linkage processing. Approximately 15,500 potential duplicate pairs of records were identified, which included records with similar identification information as well as similar reported data. The number of potential duplicates just based on reported data is approximately 1,500. The number of these potential duplicates that were determined by statisticians in the SSOs to be actual duplicates is likely less than 500. While this is a relatively small number at the U.S. level, duplication can significantly degrade the quality of county data, which are of primary interest by data users.
Following are a couple examples of the types of potential duplicate record pairs based on reported data which were identified for review and a brief explanation of the outcome of the review.
Example 1
Whole Name / Operation Name / Land Owned (acres) / Total Land Operated (acres) / Land in Principal County / Beef Cows / Total CattleT. F. Nickel / 476 / 476 / 476 / 40 / 45
J. K. Marshall / 476 / 476 / 476 / 40 / 45
Review of the data shows that T. F. Nickel has an address in another state. He is the landlord for this operation, which is actually operated by J. K. Marshall. The T.F. Nickel record was coded as a duplicate and the reported data were not summarized in the census.
Example 2
Whole Name / Operation Name / Land Owned (acres) / Total Land Operated (acres) / Land in Principal County / Beef Cows / Total CattleT. G. Sayers / T. G. & J. Sayers / 80 / 80 / 80 / 20 / 21
K. D. Myers / K. D. & S. Myers / 80 / 80 / 80 / 20 / 21
These two records were identified as potential duplicates due to matching values for five items. The SSO verified that they are separate family operations and are not duplicated. Farms with approximately 80 acres and 20 cattle are normal for this county.
5. Conclusions
Typically, record linkage is used to identify matched and non-matched records on two data files based on name, address, and other identification information. Record linkage based on matching data values is a useful feature when the two data files contain common data items. This feature can also be useful when applying record linkage to identify duplicate records within a data file.
NASS used 40 data items as matching variables in a record linkage application to identify duplication within the preliminary 2002 Census of Agriculture data file. A number of different steps were taken prior to this record linkage to identify duplication based on name, address, and other identification information. Using record linkage to match on data items was a last step to eliminate duplication in the census data file. This record linkage step likely identified less than 500 duplicates across the U.S. out of over 1.1 million records. The impact of this duplication removal was fairly minor on U.S. data, but could be significant for some counties. Since census data will be used to update the Farm Register, this duplication removal not only improves the quality of the published census data but will also improve the quality of the Farm Register.
The SuperMatch software feature that provides matching on data items was fairly easy to implement, so NASS feels this step was cost-effective. However, to apply this matching feature most effectively, the response variance characteristics of the data should be studied to identify more accurate parameter values. What response differences, relative to size and number, should be expected for the data items if they represent the same farming operation? Should matching “zero” responses contribute to the component weight? Should highly correlated data items be included as matching variables? Should edited and/or imputed values also be included in the matching, in addition to reported data? These questions are especially important to consider if a large number of duplicate records are expected in the data set.
1
[1]Fellegi, Ivan P. and Sunter, Alan B., “A Theory of Record Linkage”. Journal of the American Statistical Association, Vol. 64, 1183 – 1210, 1969.
[2]AUTOMATCH User’s Manual Version 4.0, MatchWare Technologies, Inc., 1996
3 INTEGRITY Data Re-engineering Environment User Guide Version 3.11, Vality Technology Inc., 2001