Beijing, China 18-22 October 2004
Session 7
John C Hughes and Steve Vale, Office for National Statistics, UK
Improved register data matching and its impact on survey population estimates
Abstract
This Paper describes recent work to improve the data matching techniques used in the maintenance of the UK statistical business register. This work has focussed on improving the quality of sampling frames by introducing greater certainty regarding the potential duplication of units. As a result, virtually all of the enterprises currently excluded from survey frames will either be matched to other enterprises or confirmed not to be duplicates.
1. Introduction
The Inter Departmental Business Register (IDBR) provides the sampling frame for the surveys that support UK economic statistics[1]. It is designed to meet European Union requirements as set out in the Community Regulation on the harmonisation of business registers for statistical purposes.
The IDBR is based on two main administrative sources, Value Added Tax (VAT) and Pay As You Earn (PAYE) data. There is no standard referencing system to identify businesses within the UK other than the Company Number which is only present for corporate businesses. Thus there is a need to match data from these two sources to eliminate duplication. The main variables used for matching are name, trading style, address, postcode and legal form.
In order prevent duplication, new enterprises for which we only have PAYE data are excluded from sampling frames and register population estimates until they have either been matched to other units or proven to be genuine single source enterprises. Resource constraints mean that enterprises with less than ten persons employed that are not matched to other units are not routinely included in register proving surveys. These enterprises are identified using a quality code ("inquiry stop"). Enterprises where this code is 6, 7 or 9 (approximately 9.7% of enterprises on the IDBR - the green segments in figure 1 below) are excluded from sampling frames.
Figure 1 - Different categories of register units
Figures taken from the IDBR in October 2003
Over time there has been a build up of these non-integrated enterprises with less than ten persons employed (see table 1 below). To help address this issue, the ONS recently completed a Eurostat funded project to investigate issues of data linkage and assess the impact of improved matching on business structures and survey population estimates. This paper summarises how the changes in matching methodology have resulted in improvements in the survey population estimates.
Table 1 - PAYE only unproven enterprises, 0 – 9 persons employed
October 2002 – October 2003
DATE / Out of Scope (Agriculture) / Out of Scope (Other) / In Scope / TOTALOct-02 / 17,907 / 163,453 / 154,271 / 335,631
Jan-03 / 17,631 / 169,439 / 154,498 / 341,568
Feb-03 / 17,512 / 155,575 / 161,367 / 334,454
Mar-03 / 17,498 / 154,924 / 163,369 / 335,791
Apr-03 / 17,338 / 178,981 / 156,220 / 352,539
May-03 / 17,219 / 160,722 / 167,119 / 345,060
Jun-03 / 17,123 / 176,282 / 163,454 / 356,859
Jul-03 / 17,080 / 174,009 / 166,270 / 357,359
Aug-03 / 17,080 / 174,009 / 166,270 / 357,359
Sep-03 / 16,957 / 160,565 / 173,369 / 350,891
Oct-03 / 17,042 / 186,303 / 167,071 / 370,416
2. Current matching software and procedures
Matching on the IDBR is currently carried out in two stages. The first stage is "name key" matching, and the second uses a matching software tool; SSA Name3[2]. All units on the IDBR have a name key based on the name of the business. New births to the register are matched against existing units using the name key. All 100% matches are then linked on the register. This system produces strong matches but has limitations as the name key creation conventions differ depending on the source of the data.
Those births that fail name key matching are then subject to SSA Name3 matching which is carried out initially on name only and then, if matches score above a given threshold, on name, address and postcode. Scores are allocated for each of these components, and are used to determine definite, possible and no matches.[3]
3. Enhancing the primary matching key
Using only name in the first step of the SSA Name3 matching reduces the number of matches. Using address and postcode information at this point would improve matching but would create a data storage problem. The research carried out during the project revealed that approximately 33% of units that fail the name only stage of the automatic matching process can subsequently be matched clerically. Clearly this creates a significant clerical burden. However, the research showed that over 95% of the units that could be matched clerically had a good postcode match. Therefore, the matching system will be enhanced to introduce the use of part of the postcode to improve automatic matching rates at the initial name match stage.
The impact of this change will be to reduce the number of units outside the survey frames by matching them to units already included. This will reduce the overall number of enterprises on the IDBR, but the only impact on the survey frames will be an improvement in the measure of employment for the units affected, as employment imputed from VAT turnover will be replaced by employment obtained directly from the PAYE system.
4. Re-matching data and the use of cleaned addresses
Current procedures allow for matching to take place only for new births onto the register. With the exception of ad hoc matching no subsequent re-matching takes place. The project set out to prove that periodically re-matching data adds value to the process by incorporating changes to data over time.
In order to further enhance the re-matching process it is preferable to hold address data of the highest possible quality. Therefore much of the analysis carried out was been based on re-matching data using ‘cleaned’ addresses. The address cleaning process is carried out using additional software (Matchcode 5 by Capscan[4]).
From the research carried out, it was estimated that this process will improve matching rates by approximately 3%. However, in addition to an improvement in matching, this process (together with the enhancement described in paragraph 3) also allows greater confidence regarding the status of the remaining unmatched units in terms of their likelihood to match.
5. Other enhancements
Various other enhancements to the matching process will also be carried out. Individually the impact of these enhancements is low, but in combination they help to increase the certainty that the remaining unmatched units are genuine. These enhancements include:
· Standardisation of name key generation routines
· Better separation of compound names into name and trading style (e.g. treating ‘John Smith trading as Smiths Bakery’ as two names, ‘John Smith’ and ‘Smiths Bakery’ for matching purposes)
· More use of company numbers to assist the matching of corporate units, and investigation into corporate units that do not have a company number.
6. Overall impact of the changes on the IDBR
Figure 2 shows the potential movement of enterprises from the non integrated part of the IDBR (inquiry stops 6, 7 & 9) into scope of the survey frames. PAYE only enterprises which match to units already within the survey population could potentially find a match with any enterprise but are most likely to be matched to those that are based on VAT only, based on research carried out as part of the project and on previous clerical matching exercises.
It is estimated that approximately 60,000 PAYE only enterprises will match to units in the survey population as a result of the combination of improvements described above. This stage will not result in an increase in the survey population, as the total number of enterprises in scope will remain unchanged. There will, however, be more VAT and PAYE based enterprises and less VAT only enterprises, thus the quality of the employment measures on the register should improve.
Figure 2 - Impact of the new matching procedures.
The greatest impact on the register will result from the movement of the remaining unmatched PAYE only enterprises (approximately 140,000) into scope of the survey frames. Until now, it has been considered too great a risk to use these enterprises in survey frames, however, research has shown that the remaining unmatched enterprises have a less than 5% chance of finding a match with units that are already part of the survey population.
The risk of duplication is, therefore, minimal, and is outweighed by the risk of ‘under-estimating’ the population. In figure 2, the largest arrow represents the movement of approximately 120,000 PAYE based units into the survey population. This action will have the effect of increasing the population by almost 6% in terms of the number of enterprises, but will increase the employment by only 1.4%. This is because nearly all of the enterprises currently excluded are small in terms of employment (the majority have only one or two persons employed).
7. Managing units which are potential duplicates or in transition
As the IDBR is a dynamic register which is updated on a daily basis, there will always be some units which are out of scope of survey frames because they are ‘in transition’. For example, large new units (10 or more employment) are excluded from survey frames until they have been either matched to existing enterprises or confirmed as real births.
It is anticipated that the number of units in transition will be no greater than 15,000 at any point in time and that no individual units will remain in transition for more than three months – which represents the cycle of re-matching proposed in the new system and the period of the Business Register Survey. However, it is important to monitor the number of units in transition in order to prevent any future accumulation of unmatched units, or units becoming ‘stuck’ in the process.
8. Conclusions
As a result of the research described above, the following conclusions can be drawn:
· Matching rates will be improved by the introduction of regular re-matching using cleaned addresses.
· Initial matching by name can be made more effective if at least part of the postcode is included.
· The proposed range of improvements to the matching process will increase the certainty that the remaining unmatched units are genuinely single source to the point where the risk of duplication if they are included in the survey population is outweighed by the risk of under-coverage if they are left out of this population.
· Desk profiling and clerical matching will reduce the risk of duplication still further, and should be targeted at categories of units deemed to be “high risk”. Cost-benefit considerations will determine the amount of clerical resource available for this work.
1
[1] For more information see http://www.statistics.gov.uk/idbr
[2] See http://www.searchsoftware.com/
[3] For more information on the matching process see ‘Matching Records Without a Common Identifier - The UK Experience’ www. europa.eu.int/en/comm/eurostat/ research/conferences/etk-99/papers/vale.pdf
[4] See http://www.capscan.co.uk/prodmcd5.htm