The suitability of alternative survey-derived small area statistics in international comparisons: findings from the eurarea project
Patrick Heady and Martin Ralphs
Spatial Analysis and Modelling Branch (Research)
Methodology Group,
Office for National Statistics
1 Drummond Gate, London SW1V 2QQ.
Abstract
In the use of survey-derived statistics for small areas (using model-assisted and model-based methods) the predominant focus has been on alternative methods’ ability to provide estimates for individual small areas. The quality of the estimates for a set of areas has generally been thought of in terms of the average value of such area-specific measures as bias or mean square error.
Recently, however, there has been growing interest in estimators which conserve key properties of the overall set of area characteristics. Perhaps the most important characteristic of this kind is how well the estimators represent the true extent of geographical inequality. In general “design-based” estimators can be expected to overstate the extent of geographical inequality, while model-based estimators are likely to understate it. A consequence of this is that comparisons of spatial inequality between different states may be powerfully affected by differences in the small area estimation methods they use.
The Eurarea project, carried out by NSIs and academics, used simulation methods, based on simulated samples from 100-percent census and register data-bases for six European countries, to measure the practical performance of a range of small area estimators. These included direct estimators, GREGs, regression synthetic estimators, composite estimators, and estimators that borrowed strength from patterns of temporal and spatial auto-correlation.
In this paper, we use some of these results to illustrate the performance of different types of estimators for estimating area-specific totals and explore the extent of the distortion to spatial distributions that can occur from using small area estimation methods that are optimal in other respects, relating this particularly to the issue of international comparisons of spatial inequality.
Introduction
The provision of high quality small area statistics is a growing priority for European governments, primarily so that resource allocation can be optimally directed to tackle problems such as poor housing and health, unemployment and low pay.A problem arises because the availability of small area statistics for key variables of interest is limited by the cost of data collection. It is prohibitively expensive to carry out detailed local surveys with comprehensive coverage for very small areas, and while sample surveys are carried out to collect information about key themes of local interest, these are designed primarily for efficient estimation at national level.
Small area estimation techniques (henceforth abbreviated to SAE) can help to overcome these problems by using local sample information, usually coupled with ancillary data from secondary sources such as population censuses or administrative data, to provide area-specific estimates with higher precision than a direct estimate based solelyon the sample. The Eurarea project (see Heady and Hennell, 2001, for a full description) has investigated the performance of standard and innovative methods for SAE in the European context, with the objective of providing advice to Eurostat and to European NSIs on the appropriate use of SAE methods in the context of official statistics. The full range of results obtained, including methodological findings, specimen programs and recommendations regarding statistical policy will be published in the Eurarea project reference volume and made available on the project website later this year[1].
The most significant experimental capability developed within Eurarea has been the facility to evaluate estimator performance through large-scale simulations based on population register and census datasets in six different European countries. Repeated samples are drawn in realistic ways from these bases, estimation strategies applied to them and the results compared with the true values contained within the databases across many replications. This has enabled us to assess empirically the predictive power of the estimators and the reliability of their in-built error prediction procedures.
In this paper, we introduce small area estimation approaches and use some of the empirical results produced by the Eurarea Consortium to illustrate the performance of several commonly applied SAE methods in respect of two key user requirements–the provision of reliable area-specific estimates and, of particular concern to policy makers responsible for resource allocation, the modelling of the distribution of area values. We move on to consider the implications of our findings in the light of these two objectives.
Small area estimation methods
Since the most pressing issues facing European ISIs at the moment are whether to use model-based approaches at all, and how to make the best use of the data provided by the different national statistical systems, it was appropriate to concentrate on relatively straightforward methods in Eurarea. The SAE methods that we present here are therefore not comprehensive. Rao (2003) and Pfefferman (2002) present comprehensive reviews of the methods that are available and the interested reader is directed to these references for more information.
In Eurarea, our primary interest was in comparing the effectiveness of estimators drawn from both the design-based and model-based families (Särndal 1984) and in considering how performance was affected by a range of external factors such as sampling methods, the sizes of the small areas for which estimates were required and the treatment of both binary and continuous target variables. For the purposes of this paper, we introduce four of the basic estimatorsfrom Eurarea. Our main objective is todemonstrate how the predictive power of our estimation can be enhanced through the deployment of different approaches, and to illustrate the limitations that apply when we use particular SAE strategies based on design and model-based approaches or combinations of the two.
The types of estimate produced by design and model-based estimation procedures differ fundamentally. In the case of a design-based estimator, the estimate produced is unique to each individual small area under consideration. The estimate is unbiased for that area, in the sense that, under repeated samplingthe mean of successive estimates will tend towards the true value. For model-based estimators, the situation is somewhat different. A model-based estimator utilises ancillary information to produce an estimate of the target variable that is applicable to all small areas that share similar characteristics. Thus, if two small areas have exactly the same ancillary information, exactly the same estimate will be produced for each by the model-based procedure.Unlike the situation with design-based estimators, the discrepancies between the true value for a particular area and the model-estimates generated from successive samples will not tend to average out over the long run.
The estimators
We considered the performance of four basic estimator types, which we define and discuss below. We use the following standard notation in all our equations:
1. Y denotes the survey variable of interest; X denotes ancillary data;
2. Lowercase letters refer to sample statistics and uppercase to population statistics;
3. Indices i and d refer to individuals and small areas (domains), respectively;
4. w refers to the sample inclusion probabilities of individuals ( w id is the weight for individual i in area d);
5. nis the sample size and N is the population size;
6. s refers to the sample;
7. A bar above a variable refers to the mean – e.g.is the sample mean of for area d.
8. A hat above a variable refers to an estimate – e.g., is an estimate of .
9. u and e refer to area- and unit-level random effects.
10. In order to simplify the notation we sometimes use to refer to an area-specific quantity. Thus we might write for , and for .
11. is an indicator variable, taking the value 1 if the condition inside the bracket applies to area d, and 0 if it does not.
1. Direct Estimator
The direct estimator is defined as the -weighted Horvitz-Thompson estimator for each area (Särndal et al., 1992), where is the probability of inclusion in the sample. This design-based estimator is the local average value of the target variable for sampled units in each area and is given by the formula:
, where
In practice, the direct estimator is highly vulnerable to sample size and coverage, and can only be computed for areas which are sampled. We include it here to provide a benchmark against which we can compare the performance of more sophisticated model-based and composite approaches.
- Generalised Regression Estimator (GREG):
The design-based GREG (Generalised REGression estimator) is obtained by adjusting the direct estimator for an area for differences between the sample and population area means of covariates. The adjustments are calculated by using a model relating y and X. As a standard, the ordinary regression model is used and this has been applied in Eurarea. The formula for the GREG estimator is:
where is a vector of p population meancovariates.
3. Area-level Synthetic Estimator:
Synthetic estimators assume a model that describes the relationship between the target variable y and set of ancillary data X. Through this modelled relationship, the ancillary data can be used to predict the mean of y for all target areas. The estimator and its variance are developed on the assumption that the model used accurately describes the population. If the model can draw sufficient power from the available ancillary data, the method can provide substantial gains over direct estimates which rely solely upon survey data. In practice, area models have been used more extensively than other synthetic estimators for SAE work.
In this example, a linear model with area-level covariates is fitted to the sample area means of the target variable. The model is , and the estimator is , whereand eid are independent variables with mean 0 and variances and . The variance termis the main component the MSE.
4. Composite Estimator:
Composite estimators attempt to improve performance by combining the strengths of synthetic and design-based estimators. The example given here is a weighted combination of the area-synthetic and direct estimators. The estimator is given by the following formula:
where and and nd is the sample size of area d. Thegamma term is a weight based on the modelled variability of the areas and the sampling variability of the data collected in each area. Itis used to adjust the contribution of the direct and synthetic components of the estimator. When sampling variability is high (and the reliability of the direct estimator is therefore questionable), the composite estimator is weighted in favour of its synthetic component. When the variance of the direct estimator is low, the estimator is weighted in favour of its direct component. In situations where an area is unsampled, the estimate will be based wholly on the synthetic component of the estimator.
Estimating area-specific values
The primary goal of SAE is the precise estimation of area-specific parameters to produce optimal local estimates. We now consider how well each of the above estimators performs when predicting individual area values. For this we need some criterion of what counts as a good predictor. The criterion we adopt here is the minimisation of squared area loss for each area. A good estimate for area d is thus one which minimises
The results we present are summaries derived from Eurarea simulation studies of estimator performance for three target variables: equivalised household income, proportion of single person households and the ILO-definition unemployment rate. The simulation process consisted of drawing samples, using approximations to the sampling designs that would be used in practice, applying the various estimation procedures, and comparing their estimates to the true values for all areas in the study data-set. The process was repeated a large number of times (typically 500).
This enabled us to produce an empirical summary of the MSE properties of each estimator, in the form of the average empirical mean squared error, which we define thus:
where is the estimate of the target variable for area d in simulation k, is the true mean of Y for area d and K is the number of replicates in the simulation. The smaller this quantity is, the better the estimator has performed over the whole set of areas.
In Figure 1 we summarise this information for six countries in European NUTS5areas. We do not show the MSE results themselvesin the graph; instead we show the mean value of the rank achieved by each estimator across all of the simulation runs, since this allows us to compare results from different countries concisely. The graph shows that model-basedapproaches (the synthetic and composite estimators)consistentlyexhibit improved MSE performance over the design-based estimators (the Direct and GREG) for these small geographical areas.
Figure 1 – Estimator performance by mean rank based on average MSE across simulation runs at NUTS5 level. Lower ranks indicate better performance.
Estimator performance is influenced by a range of contributory factors. We will now consider some of the most significant ones and then discuss theirimpact upon the results that we achieve.
We have stated that design-based estimators are particularly vulnerable to sample size and consistency and that these estimates can only be produced for areas that contain a sample. This is a problem in small area estimation, since many national surveys feature clustered sampling designs which provide data for only a subset of the areas of interest. To produce stable estimates with an acceptable level of variability, these estimators require large samples. Once again, this is problematic, since sample sizes will typically be very small, particularly at geographical levels below NUTS3.
Model-based estimators are much less vulnerable to sample size than design-based estimators. Instead, their predictive power is reliant on the use of ancillary information in the form of a set of X variables (covariates) to which the sampled values of the target variable are related. In this case, it is the choice of an appropriate set of ancillary data that is of critical importance – such data must be available for all of the small areas for which estimates are required and, ideally, will be strongly related to the target variable for maximum effectiveness.
The main limitation of the model-based approach is estimator bias. Because synthetic estimators apply a globally fitted model consistently to all target areas, they tend to underestimate extreme values, “shrinking” these towards the global mean. Additionally, theymay systematically underestimate or overestimate values for particular subtypes of area if the causes of variation in the target variable particular to these subtypes are not captured by the ancillary information in the model.
Estimating the distribution of area values
While the performance of estimators for particular areas is a relevant criterion when the estimates will be used to decide on resource allocation to particular areas, there are other policy applications for which it is more important that the set of estimates produced by SAE reflect the overall distribution of area values over the different areas in the country. This is important if the government wishes to assess the overall extent of geographic inequality for the variable concerned, or if the applications for funding by some higher-level institution (such as the European Community) depend on the number of areas in a country which fall below some specified threshold.
From this point of view, a reasonably good set of estimates might be one for which the empirical standard deviation of the true area values was close to the empirical standard deviation of the estimated area values – i.e. one for which
(1)
where
{} is the set of true area values;
{} is a set of estimated area values generated by applying the estimator concerned on any one occasion;
is the mean of the true values of the D areas, and is the corresponding mean of the estimated values.
Of course the standard deviation does not fully specify the distribution, since two distributions with the same standard deviation could still have different shapes, in the sense of being differently skewed, or having a different degree of kurtosis. Ideally the empirical distribution functions of the two sets of estimates should resemble each other – i.e. the value of
(2)
should be as small as possible, where
and (3)
If the value of the integral actual was zero, i.e. if and were equal over their whole range, it would mean that any functions calculated on the whole set of estimates – such as the proportion of areas for which was below some critical value, or a measure of inequality such as the Gini Coefficient – would also apply to the set of true values.
However, since the expressions given in (2) and (3) are rather intractable, we will base the analysis that follows on the comparison of standard deviations. Since the equivalence of the true and estimated standard deviations is a necessary condition for the equivalence of the two distributions, any difficulties revealed by the comparison of standard deviations will apply even more strongly to comparisons of the estimated and true distributions. Rather than being based on a single application of each estimation method, the comparisons are derived from the average over many simulations. I.e.