ESSnet
Use of Administrative and Accounts Data in Business Statistics
WP3
DEVELOPMENT OF ESTIMATION METHODS FOR BUSINESS STATISTICS VARIABLES WHICH CANNOT BE OBTAINED FROM ADMINISTRATIVE SOURCES
Variable:
Purchases of goods and services for resale in the same condition as received
Ria Sanderson, Duncan Elliott, Daniel Lewis and Tracy Jones
UKOffice for National Statistics
Methods for estimating Purchases of goods and services for resale
- Deliverable 3.3 for SGA 2010
1. Introduction
The scope of WP3 is to investigate methods of estimation for variables where administrative data are not directly available. This report considers methods of estimation for the Structural Business Statistic (SBS) variable "purchases of goods and services for resale in the same condition as received". This is a component of total purchases of goods and services and its full definition is available from the Commission of the European Communities (1998). The situation considered is one where administrative data are not directly available to replace survey data; within this framework, two approaches have been investigated to see if administrative data can be usefully employed. The first approach is to consider how administrative data can be used to predict purchases of goods and services for resale if the survey were to be discontinued. The second approach introduces a "take-none" stratum, where a subset of the survey population is no longer considered eligible to participate in the survey (e.g. Elliott et al 2010; Yung & Lys 2007; Zach 2007). In this case, it is necessary to identify how suitable administrative data sources and estimation methods can be used to model the contribution of the excluded businesses.In both cases, it is crucial that the quality of the variable of interest is maintained.
2. Data sources
The estimation methods described in this report were primarily tested using data from the UK Office for National Statistics. In the UK, purchases of goods and services for resale is currently collected through the Annual Business Survey (ABS). This is a stratified simple random sample of the UK economy, where businesses with employment of 250 or more are completely enumerated. It is the experience of the countries participating in WP3 that this variable is generally collected through a survey, with administrative data being used to estimate purchases in some countries for the smallest businesses, and also to estimate partial missing responses. Two sources of administrative data were considered in this investigation, to see if they were useful in estimating purchases of goods and services for resale. The first source was VAT turnover, and the second was data available from annual company accounts. This took the form of profit and loss accounts, balance sheets and cash flow statements. The VAT turnover is collected on a series of different staggers, and can be reported monthly, quarterly or annually. Some basic cleaning of the VAT data was performed to detect suspicious reporting patterns and suspicious values. The data were then annualised using a very simplistic approach for this investigation, by summing monthly data to cover calendar years, and by dividing quarterly data into monthly data before summing to cover calendar years.
The ABS survey universe covers a wide section of the UK economy. There is not always a one-to-one match between businesses on the VAT register and businesses in the target population of the ABS. The match rate between the VAT turnover data and the ABS universe is typically 73-75%. This does vary depending on the size of the business, with larger businesses exhibiting a poorer match rate. This is most likely due to more complicated structures for reporting VAT data in these businesses. There are approximately 200 variables available from the company accounts data, but these are also affected by the difficulty in obtaining a one-to-one match between the administrative data source and the survey universe. Typically, the match rate between company accounts data and the survey universe is 43-46%, depending on the year. Annual accounts can also be returned throughout the year, so where matches do exist, they are likely to cover different time periods. Basic cleaning was performed to remove instances of the same business submitting accounts more than once within the same year, but complex structures for some businesses means that double entries may remain in the data. The level of detail required when submitting accounts information also varies depending on the size of the business, so many of the variables have a high proportion of missing values, even after allowing for the match between the administrative data source and the survey universe.
To summarise, the nature of the available administrative data is as follows. Neither source gives a complete match to the survey universe, and both data sources have missing variables once a match to the survey universe has been performed, although the percentage of missing VAT turnover in this matched data is small (typically less than 1%). The correlation between purchases of goods and services for resale and the company accounts variables varies depending on the variable and is generally low, but the strongest correlations out of the variables with less than 50% missing data are 0.41 (2007, "other debtors" ) and -0.28 (2007, "other current liabilities" ). The correlation between purchases of goods and services for resale and VAT turnover ranges from 0.12 to 0.44, depending on the year. The methods that follow consider simple treatments for missing values, and also basic methods for allowing for the match rate between the administrative data and survey universe. In general, the administrative data sources do not need to offer complete coverage of the survey universe, as long as appropriate assumptions are made over the nature of the matching and presence of missing data.
3. Methods Used
The estimation methods tested are based on the following assumed situation. The survey universe for a sample survey can be partly matched to an administrative data source, leaving a "matched" universe and a non-matched universe. This situation is illustrated in Figure 1.
Figure 1. The nature of the match between the survey population and the administrative data. Part of the survey population cannot be matched to an administrative data source.
With this situation in mind, two broad strategies have been investigated. The first strategy is to consider the effects of stopping the survey completely, and instead applying a model derived from past data in a unit modelling and imputation approach. In this case, administrative data will not be available for all units where predictions are required. The second strategy is the implementation of cut-off sampling, in an effort to reduce the burden placed on the smallest businesses. In this scheme, the collection of purchases of goods and services for resale via a survey would be maintained for those businesses that fell above the cut-off. For those businesses falling below the cut-off, some will fall in the matched part, meaning related administrative data are available, and some will fall outside this part, with no related administrative data available. We now discuss both of these strategies in detail.
3.1 Unit modelling andimputation
The unit modelling and imputation approach attempts to estimate purchases of goods for resale for all units in the survey population. This is achieved by using a combination of modelled predicted values of the purchases variable for units matched to administrative data sources, and imputation techniques for the remainder of the survey population. The modelled predicted values are based on developing unit level linear models for the ABS purchases variable using VAT turnover and expenditure variables. By definition, this is only possible for those businesses that appear in both the ABS sample and the VAT data set in a given period. For all other businesses, values of purchases of goods for resale are estimated using available imputation techniques. Having estimated the purchases variable for every unit in the population, either by modelling or imputation, survey outputs can be created by simply summing unit values up to required aggregates.
The modelled predicted values used in this approach were already available, from a separate UK study. The method for creating them is briefly summarised here. Predicted values were created by modelling VAT data with previous ABS survey data for the periods available. The idea was to develop models using previous period data and then create predicted values by applying the coefficients resulting from those models to current period VAT data. The main modelling technique involved fitting linear regression models to the VAT and ABS data. As well as VAT turnover and expenditure, other covariates such as 2 digit NACE industry, size-band and region were tested in the models. The models were all tested using the usual range of model statistics and diagnostic checks. Data splitting was also used to check that individual models were robust. This involved randomly selecting half the data, fitting the model, before applying the model to the remaining half of the data.
The most common imputation method for business surveys is ratio imputation, where the previous response for the business is multiplied by an average growth within an imputation class. However, the fact that the predicted values were only available for a small subset of the population meant that for most units there was no previous predicted value (or even previous survey value) available to use for this method. Another common imputation method is donor imputation. However, donor imputation relies on the availability of multiple correlated auxiliary variables to work well. Such variables were not available in this case. For these reasons, the only feasible imputation method involved taking averages of the predicted values within imputation classes (defined as the cross-classification of 4 digit NACE and employment size-band). Both trimmed means and medians were tested for this purpose. The trimmed means were created by removing the top and bottom 10% of predicted values in all imputation classes of size greater than 10.
3.2 Cut-off Sampling
Within the cut-off sampling framework, three methods have been tested to see if the accurate estimation of purchases of goods and services for resale is possible based on the available administrative data. The first uses a simple ratio adjustment to estimate for domains below the cut-off, using sample data from above the cut-off. The second approach is an extension of the first, and uses unit level modelling to predict purchases of goods and services for resale for the survey universe below the cut-off, again by making use of data above the cut-off. The third method aims to correct for the bias introduced by taking a cut-off sample, through the generalised calibration estimation methods described by Haziza et al (2010).
3.2.1 Notation
A variable x1 is used to define the cut-off. A business belongs to Uc if it falls below the cut-off c, whereas it belongs to Um if x1is in the range c to m, where m denotes an upper boundary separating those units used to estimate for Uc from the remainder of the universe Ue. In this example, Uerepresents the largest businesses which are completely enumerated:
(1)
The population total Y is therefore defined as:
(2)
where for equal to c, m or e:
(3)
3.2.2 Simple ratio adjustment
The simple ratio adjustment for estimating for units below the cut-off is (Särndal et al 1992:532)
(4)
where is the total of the auxiliary variable in the survey universe below the cut-off, is an estimate of the total of the auxiliary variable in the survey universe above the cut-off and is an estimate of total purchases of goods and services for resale in the survey universe above the cut-off. We refer to as the ratio-adjusted estimate. This method was investigated using both VAT turnover and company accounts variables as the auxiliary variable. However, there were some differences in the exact methodology applied for these two data sources. The cut-off was set using the employment variable held on the business register.
With the VAT turnover data, investigations were made primarily for the matched part of the survey universe without treatment for the missing values. Investigations were also carried out for the full survey universe assuming that missing values take the mean value in a stratum, calculated including zero values. In these analyses, the position of the cut-off was allowed to alter, to minimise the difference between the original survey-based estimate and the estimate using the simple ratio adjustment method. This method was tested using survey data from the years 2004 to 2008 (inclusive), which involved comparing the ratio-adjusted estimate with the original survey-based estimate.
The poor match rate (compared to the VAT data) and the high level of missing data led to a more comprehensive treatment of the missing data when testing this method with company accounts data. Variations in the method have been tested in an effort to address this non-response in the auxiliary variables, and undercoverage of the survey universe. The method was tested with both the full survey universe and for just the matched part. These estimates were made under two assumptions regarding missing data; first, that missing data is a true zero, and second, assuming that missing values are replaced by the mean value for the stratum. The estimator given by equation 4 was also re-calculated using the total of the auxiliary variable available rather than its estimate, to make maximum use of the available data. The estimators are described in detail in Annex 1. For both the company accounts data and the VAT turnover data, the methods were evaluated via bootstrap re-sampling.
3.2.3 Unit level modelling
An extension of the simple ratio adjustment method is to fit a linear regression model with multiple covariatesto predict purchases of goods for resale for the survey universe below the cut-off. Data from sampled units above the cut-off were used to produce a suitable model. We considered a case where businesses in the smallest employment size-band were no longer sampled, and used the data from the remaining sampled employment size-bands to produce a model for the purchases of goods for resale. The variable was found to show apositively skewed distribution, so a logarithmic transformation was performed, and a model was fitted to the transformed variable. A linear regression model was fitted via least squares to the logarithm of purchases of goods as a linear combination of the available administrative data variables. The model could only be fitted to those units with positive returned values of purchases as zeros are undefined on the log scale. Due to the large proportion of zero returns in the variable purchases of goods for resale, a logistic regression model was also fitted to predict the probability that a unit would give a positive return. These predicted probabilities were then multiplied by the predictions from the linear model to determine the final prediction.
Variables from the business register (region, industry, employment and turnover) and VAT turnover were tested in the model, and their significance assessed via type III sums of squares, and the effect on the resulting R2. Aback transformation (see Annex 2) was applied to the results of the linear modelling to transform the response variable back to the original scale. In the logistic regression, the specificity (the percentage of positive values identified as such based on a boundary probability equal to the proportion of positive values in the original data), the sensitivity (the percentage of zero values identified as such based on a boundary probability equal to the proportion of positive values in the original data) and the overall percentage of values correctly identified were used to directly compare models. A locally weighted regression 'loess' smooth of a plot of the deviance (or Pearson) residuals against the predicted probabilities was also used to look for model inadequacies (Kutner et al, 2005).
3.2.4 Generalised calibration estimation
The introduction of cut-off sampling leads to bias in the estimates (Särndal et al, 1992:531). This may not necessarily be a concern if the bias introduced is smaller than the standard error of the survey-based estimate. Haziza et al (2010) present generalised calibration estimation as one method for reducing the bias introduced by cut-off sampling. The method is described in more detail by Kott (2006), who builds on the framework of calibration weighting introduced by Deville & Särndal (1992). Generalised calibration attempts to correct for the bias in a cut-off sample by using two vectors of auxiliary variables in the estimation. The first, denoted x, is an auxiliary variable that is related to the variable of interest, and the second, denoted z, explains fully whether a unit falls above or below the cut-off. Haziza et al (2010) performed a simulation study which showed generalised calibration estimation to be effective at reducing the bias in the case where there was no residual relationship between the probability of being above the cut-off and the x variable, after conditioning on the z variable. We consider a simple case where these vectors each contain only a single auxiliary variable. The choice of the auxiliary variables xi and zi is discussed in section 4.4.