Empirical Likelihood approach to census coverage estimation
Ewa Kabzinska ()[1], Paul Smith ()1, Yves G. Berger ()1
Keywords:Census, Census Coverage Survey, Dual System Estimator, Empirical Likelihood, estimation, survey, estimating equations, complex surveys
1.Introduction
Usually a small proportion of the population is missed by a census. A Census Coverage Survey is carried out in order to estimate the census coverage and improve census population size estimates. Currently census coverage in England & Wales is estimated using a ratio estimator, and symmetric confidence intervals are calculated based on a Jackknife variance estimator. A detailed explanation of the sampling design and current approach to estimation has been provided by [1] and [2].
In some areas and demographic groups the census coverage is very high. This makes use of symmetric confidence intervals inconvenient, as the upper bound exceeds 1. We propose to use an empirical likelihood approach to obtain confidence intervals for census coverage. More specifically, we rely on the design based empirical likelihood methodology proposed by [3]. We show how this can be applied to the problem of census coverage estimation. We also apply the proposed methodology to the 2011 census of England and Wales and reflect on the properties of the obtained confidence intervals. Finally, we assess the asymptotic performance of the proposed estimator in a series of Monte Carlo simulations.
2.Methods
2.1.Census Coverage Survey
The Census Coverage Survey (CCS) uses a stratified cluster sampling design. The primary sampling units are small geographical entities, called Output Areas, stratified by Local Authority and Hard to Count index. In each of the sampled Output Areas, a sample of postcodes is taken and full enumeration of households is attempted within the selected postcodes. The total number of households is measured and some additional household and person level information is gathered for each household.
A dual system estimate of the population size of each postcode is calculated using the following formula:
where is the number of households who were enumerated both in census and in CCS, is the number of households who were enumerated in census but not in CCS and is the number of households who were not enumerated in census but were enumerated in CCS. The dual system estimates are adjusted to account for over-count, bias, sample imbalance and dependence between census and CCS.
The second stage of estimation consists of estimating the overall census coverage for each Estimation Area, based on the adjusted values of obtained for the sampled postcodes. The current methodology estimates the ratio of the census population count and the dual system estimator by fitting a straight line to the observations. A simple ratio estimator is then used to estimate the population count within each Estimation Area.
2.2.Empirical likelihood approach
Empirical Likelihood is a non-parametric, likelihood-based inference approach. The method was proposed by [4] and has been considerably extended since then (see [5] for a review). We decided to follow an approach proposed by [3]. This approach allows us to construct Wilkstype confidence intervals based on the approximation of the log likelihood ratio function directly, i.e., without the need for calculation of variance estimates or design effects. The confidence intervals are asymmetric, with the bounds depending on the shape of distribution of the sample data. They are also range-preserving. The method accommodates complex, i.e., stratified and clustered, sampling designs.
3.Results
We applied the proposed empirical likelihood approach to data from the 106 estimation areas enumerated in the 2011 England and Wales census. For comparison, we also computed a ratio estimator and constructed symmetric confidence intervals based on three methods of variance estimation: linearisation, jackknife and bootstrap ([6]) with 100 replicates. Please note that the actual census coverage estimates produced by the Office for National Statistics utilise a complicated system of adjustments and the values reported in this paper are not representative of the official estimates.
The empirical likelihood confidence intervals are of similar length to the symmetric confidence intervals. The symmetric confidence intervals tend to exceed 1 when the estimated coverage is high. The proposed empirical likelihood confidence intervals, however, are below 1 in all cases. For illustration we present confidence intervals obtained for thirty five age-sex groups in Cambridgeshire and Peterborough,using all four methods (please see figure 1 below).This area has relatively high census coverage in several age-sex groups. The empirical likelihood confidence intervals in these groups are clearly asymmetric, with the upper bound being very close to 1, though never above 1 (see for example groups 6 and 27).
Monte Carlo simulations based on synthetic data, not reported here for brevity, show that the coverage of empirical likelihood confidence intervals is of acceptable order.
4.Conclusions
We present an empirical likelihood methodology for estimation of census coverage. The main benefit of the proposed approach is the range preserving feature of the asymmetric empirical likelihood confidence intervals. The confidence intervals obtained from the 2011 England and Wales census data are close to the symmetric confidence intervals and of comparable length, though they never exceed 1, which is a desirable feature.
Confidence intervals for age-sex groups, Cambridgeshire and Peterborough
Figure 1. Confidence intervals (c.i.) for age-sex groups in Cambridgeshire and Peterborough. Empirical likelihood c.i., symmetric c.i. based on variance linearization, symmetric c.i. based on jackknife variance estimation, symmetric c.i. based on bootstrap ([6]) variance estimation.
Please note that the actual census coverage estimates produced by the Office for National Statistics utilise a complicated system of adjustments and the values reported in this paper are not representative of the official estimates.
References
[1]Abbott, O. (2009), 2011 UK Census Coverage Assessment and Adjustment Methodology," Population trends, (137), 25.
[2]Brown, J., Abbott, O., and Smith, P. A. (2011), Design of the 2001 and 2011 census coverage Surveys for England andWales," Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(4), 881-906.
[3]Berger, Y., and Torres, O. D. L. R. (2016), Empirical likelihood confdence intervals for complex sampling designs," Journal of the Royal Statistical Society: Series B (Statistical Methodology), pp. 1-23.
[4]Owen, A. B. (1988), Empirical Likelihood Ratio Confidence Intervals for a Single Functional," Biometrika, 75(2), 237-249.
[5]Rao, J. N. K., and Wu, W. (2009), Empirical Likelihood Methods," Handbook of statistics: Sample Surveys: Inference and Analysis, D. Pfeffermann and C. R. Rao eds. The Netherlands (North-Holland), 29B, 189-207.
1
[1]The University of Southampton