22nd Meeting of the Wiesbaden Group on Business Registers

- International Roundtable on Business Survey Frames

Tallinn, 27 – 30 September 2010
Andrew Allen
ONS
UK
Session No 4
Co-operation with Administrative Sources

Lessons Learnt from an Administrative Data Failure

Abstract

The UKbusiness register receives administrative data on employment taxation (PAYE records) from the revenue collection department, Her Majesty's Revenue and Customs (HMRC). The introduction of a new computer system at HMRC has led to many problems with the supply of data. The paper will outline how data are used in the business register and then describe the problems with the receipt of data from the new computer system. The lengthy quality assurance processes that were put in place to try to ensure that a consistent dataset was supplied will be outlined. Then the paper will explain what lessons have been learnt from the experience

Background

The business register in the UK relies on administrative data from the tax department Her Majesty's Customs and Revenue (HMRC). HMRC provides two data sets,firstly Value Added Tax records and secondly “pay as you earn” (PAYE) tax records, which is income tax. The VAT record is the primary source of data forthe business register, and a daily feed is received from HMRC. The PAYE data is the secondary source, and received quarterly.

In 2008, HMRC notified the ONS that they would be introducing a new computer system for the PAYE data. ONS worked closely with HMRC over the following year to ensure that the data required from the new system was closely specified. A transitionprocess was agreed which involved the supply of various levels of test data, to ensure technical delivery and to ensure that the data coming from the new system was consistent with the old data supply.

The problem

In July2009, the first test files were received and processed by the ONS IT team. This test file contained dummy data and was sent to allow technical testing of the record structure. So at this stage no problems were encountered.

Then in August 2009, the first full data test files were received, ahead of the main delivery in September. These were processed on a test database. The ONS maintains a test version of the business register which is used for testing system enhancements and other changes. The results from the test system indicatedthere were problems with the data. The test showed that there were major differences at unit level between the data from the last live delivery from the old system and the delivery from the new system.

A list of data quality problems was sent to HMRC, and they and their external computer contractors, set about using this information to identify and fix the system problems.

Amendments were made to the delivery by extracting data from the HMRC data warehouse using a tactical solution i.e. a short term fix, while the problems with the new system were resolved.

Unfortunately the development of the tactical solution was to slow, so this led to a missing quarter of employment data. The business register system could not function properly with the missing quarter’s employment because there would be a knock – on effect to a number of processes that use the employment data.

So, after consultation with ONS Methodologists, the previous quarter’s employment was rolled forward. Effectively we now had two Junes instead of a June and September 2009.

The tactical solution then became available in time for the December 2009 delivery. This was extensively quality assured by processing in the test system, before we finally accepted it onto the main system.

The tactical solution has been used since and we are about to get the first full delivery from the new system in September 2010. This will require a full range of checks on the test system before we can be sure that itis of sufficient quality.

Programme of Quality Assurance

A micro and macro approach was used to quality assure the employment data. This approach is continuing each quarter until we are satisfied that the quality has settled down.

Firstly we examined a number of individual records, where we had knowledge of the employment from other sources. This happened at a time of recession in the UK and there were a number of high profile business closures. We checked that their employment was behaving as expected ,and indeed it was growth in employment in some of these that first alerted us to the problems.

We also had recent employment returns from the Business Register Survey. There are known reasons for differences in the level of employment between these two sources, but if the differences were bigger than expected, we knew there was a problem with one source. We also used data collection staff to telephone a sample of businesses where there are appeared to be a problem to get the business to confirm its employment.

The business register contains over 2 million enterprises, so there was a limit to individual record checking. So in parallel we examined the total employment on the register over anumber of quarters and compared this withthe test system.

Three different analyses were carried out:

The first was to try and uncover the impact of carrying June 2009 data forward into September. This was not straightforward, because of the way the register is updated from a mix of sources. After a top level check that was inconclusive, we needed to filter out enterprises that had their employment updated from other sources such as the business register survey or the structural business survey. This left the residual that was updated from PAYE and we could look at the growth rate for it.

The second analysis was the quarterly monitoring of the PAYE files. This looked at quarter on quarter growths by employment size, as well as unusual changes in employment such as the number of PAYE schemes going from zero to greater than zero employment and vice versa. This analysis did not directly assess the impact on the register but looked to ascertain the overall quality of the file delivered by HMRC.

The third analysis matched individual's information from the Annual Survey of HourlyEarningsto large PAYE returns which were known to be suspicious. This looked at the date of birth which helped identify pension schemes. The inclusion of pension schemes had been identified as one of the data quality problems.

Impact

The micro and macro quality assurance work has meant that so far the impact on surveys has been negligible, and unless any new problems emerge, we do not anticipate problems for surveys. Any residual risk would hopefully have a limited impact, because most of the large businesses receive employment data from the Business Register Survey (BRS). PAYE is only used to update the employment on businesses which have not been in the BRS survey during the last four years. So this means that PAYE is used to update mostly small businesses.

The IDBR contains 2.3m businesses and the latest analysis indicates that while 1.1m of those are updated from PAYE, the employment coverage of PAYE is a much lower proportion, with 4.7m employment out of a register total of 28m.

The quality checking identified a range of problems with the data, which HMRC were able to use temporary fixes to resolve. There is the risk of an undiscovered problem that would have a wider impact, but the overall register employment is monitored closely when each PAYE dataset is received and no large changes to the overall register employment have been seen so far.

Lessons Learnt

The main lesson learnt is that administrative data is not perfect. By definition, administrative data is designed for another purpose, and statisticians have to make do with what is available. The new tax computer system was designed withthe requirements of tax collection in mind, and the statistical requirements were not a priority.

This problem also highlighted the value of multiple data sources when trying to maintain a high quality business register. For example in this case we were able to validate the PAYE data against the business register survey. We also still have the VAT data for information on births, deaths and turnover size. So even if there is considered to be duplication in source data, this case demonstrates the value of exploiting more that one data source.

It is important to work closely with the administrative data supplier to build a good understanding of their data. As a result of the problem, we recently held a workshop with our HMRC colleagues to discuss the administrative processes that they use, and to discuss thestatistical needs of the business register. With the benefit of hindsight, this would have been a useful thing to have been done in the past. A lot of information about the administrative data gets lost over the years and assumptions aremade about it. By speaking in detail about our respective needs we were able to get a better understanding of the data. This is an ironic benefit of the crisis, because we would not have got the attention of the key HMRC staff, to question them on the detail of their processes without the computer system problem arising.

Different terminology can lead to misunderstandings, so thereis a need for clearer Meta data. For example the tax system uses “main” and “sub” as a way of classifying employment. This has been used as a proxy for full time and part-time employment by statistical users , but dialogue with the HMRC has indicated that we cannot soundly make this assumption.

Users of the business register are always trying to stretch the uses. By talking to HMRC colleagues we have a better understanding of the data and can advise users when they are trying to over analyse the register. For example , some survey users were keen to sample using full time equivalent employment which relied on assumptions about full time and part time employment from the administrative data, which as we have seen were incorrect.

Conclusions

The business register is dependent on administrative data, but the quality of that data cannot be controlled by the register team. Therefore it is important to have multiple data sources to reduce the risk of register quality problems. Also it is important to develop detailed knowledge of the administrative data and keep a regular dialogue with the suppliers.