Multilevel modelling of survey data under two-stage design: inference for regression parameter and small domain means by using an empirical likelihood approach

Melike Oguz-Alper ()[1] and Yves G. Berger ()[2],

Keywords: Design-based inference, generalised estimating equation, empirical likelihood, two-stage sampling, small domain estimation, unequal inclusion probability.

1.Introduction

Data used in social, behavioural, health or biological sciences may have hierarchical structure due to the population of interest or the sampling design. Multilevel or marginal models are often used to analyse such hierarchical data, or to estimate small domains means. Hierarchical sample data may be selected with unequal probabilities from a clustered and stratified population. The sample design is informative when the selection probabilities are associated with the study variable after conditioning on the model covariates. Ignoring informativeness may provide invalid inference for the parameter [1].

We propose using a design-based profile empirical likelihood approach to make inference for the regression parameters, which are defined as the solutions to generalised estimating equations, and for small domain means. This approach can be used for point estimation, hypothesis testing and confidence intervals. It provides asymptotically valid inference for the finite population parameters. Right coverages are obtained for small domain means. We consider a two-stage sampling design, where the first stage units are selected with unequal probabilities. We assume that the model and the design have the same hierarchical structure[e.g. 1, 2, 3]. We consider an ultimate cluster approach [4], where the empirical likelihood function is defined at the ultimate cluster level.

2.METHODS

Let be a finite population comprised of disjoint finite primary sampling units () of sizes , with . Let be the sample of , selected with replacement with unequal probabilities [5] from , where . Let denote the fixed number of draws from . We assume that the sampling fractions are negligible. Let . The sample can be also a without-replacement set of units, because sampling with and without replacement are asymptotically equivalent when is negligible [6,p.112]. Under sampling without replacement, denotes the inclusion probability of unit . Let be the sample of secondary sampling units (), of size , with , selected with conditional probabilities within the th selected at the first stage. We assume that the size of the , , are asymptotically bounded. Let be the values of a variable of interest and the vector of values of explanatory variables. The variables and are associated with the th unit within the th cluster, where and . Consider the multilevel model

where and are independent random variables with zeromeans and variances and respectively.

2.1. The regression parameter of interest

The finite population parameter of interest is the solution to the population generalised estimating equation [e.g. 7] given by

whereand , where [e.g. 8, p.174], where is the identity matrix and is the column vector of ones. The sample weighted estimator is defined as the solution to [2, p.4].

where and are the sample-based sub-matrices of and, which contains the observations of the sample , and is a sample-based estimator of [e.g. 8, p.174].

2.2. Domain means of interest

Let be a domain of interest, where . Let if the unit in is within the domain and otherwise. Let denote the population size of . Under model (1), the population means of is

where and . Assuming , the consistent finite population parameter is

Assuming known, a design--consistent estimator of the finite population parameter is the regression synthetic estimator [e.g. 8, p.36] given by

2.3. Empirical likelihood approach

Consider the empirical log-likelihood function [9] given by , where the are unknown scale-loads allocated to the [10] and denotes the vector of . Let the maximizes subject to the constraints and

where and . The maximum value of under and (5) is given by

Suppose that the parameter of interest is a sub-parameter of ; that is, where is a nuisance parameter. The profile empirical log-likelihood function is defined by . The maximum empirical likelihood estimator that maximizes is given by (2). Under some regularity conditions [11], we have that

in distribution with respect to the sampling design and can be used for testing hypotheses. Confidence intervals can be constructed based on (7), when is scalar.

When the finite population parameters of interest are domain means (3), in (5) and and in (6) are respectively replaced by , and . We propose treating the regression parameters as the nuisance parameter and the finite set of as the parameter of interest. Thus the profile empirical log-likelihood function considered is where denotes the parameter space of . The maximum empirical likelihood estimator that maximizes is given by (4) with being the solution to (2). We have ..

3.Results

In this Section, we report the observed coverages of the empirical likelihood confidence intervals for domain means (3). The population is generated from

where , , , with and , with . Here, and are selected randomly with-replacement among the values and , respectively. The number of clusters is . The cluster sizes are generated randomly from , with , where is the standard deviation of , which gives ranging between and , with . The number of domains is . The domain sizes were determined based on that were generated from . The of was inflated by , another was inflated by , and the rest was inflated by . The units were randomly allocated into 250 domains of sizes ranging fromto. The range of is [-0.33,0.65].We selected two-stage samples. The first stage selects a randomized systematic sample of with unequal probabilities proportional to , where . We have . For the second stage, simple random samples of were selected within the ithsample. We have . The sample sizes within domains are random.

In Table 1, we present the range of observed coverages of empirical likelihood confidence intervals for domain means (3). The observed coverages are not significantly different from the nominal level () for all the domains. We observe right coverages even when domains have few or no sample units.

Table 1. Observed coverages (%) of 95% confidence intervals for domain means (3).

Expected sample sizes within domains / The range of domain sample sizes / Number of domains / Range of observed coverages (%)
0-4 / [0, 18] / 22 / [94.9, 95.1]
5-9 / [0, 27] / 78 / [94.8, 95.1]
10-14 / [0, 33] / 50 / [94.9, 95.1]
15-19 / [2, 37] / 30 / [94.8, 95.1]
20-29 / [3, 53] / 42 / [94.8, 95.0]
30-69 / [10, 103] / 28 / [94.9, 95.1]

4.Conclusions

The approach proposed may provide better confidence intervals even when the point estimator is not normal, the data is skewed or includes outlying values and the sample sizes within clusters are small. The numerical work shows that the empirical likelihood confidence intervals have the right coverages for small domain means. Standard confidence intervals may have poor coverages when sample sizes are not large enough or data includes outlying values. The empirical likelihood confidence intervals have the advantage of not depending on variance estimation, re-sampling, linearisation and second order inclusion probabilities. It is not based on the normality of the point estimator. The approach proposed takes into account of the sampling design, and can accommodate informative sampling.

References

[1] D. Pfeffermann, C. Skinner, D. Holmes, H. Goldstein, and J. Rasbash, “Weighting for unequal selection probabilities in multilevel models,” Journal of the Royal Statistical Society. Series B, 60, 23–40, 1998.

[2] C. J. Skinner and M. De Toledo Vieira, “Variance estimation in the analysis of clustered longitudinal survey data,” Survey Methodology, 33 (1), 3–12, 2007.

[3] J. N. K. Rao, F. Verret, and M. Hidiroglou, “A weighted composite likelihood approach to inference for two-level models from survey data,” Survey Methodology, 39, 263–282, 2013.

[4] M. Hansen, W. Hurwitz, and W. Madow, Sample Survey Methods and Theory, volume I. New York: John Wiley and Sons, 1953.

[5] M. H. Hansen and W. N. Hurwitz, “On the theory of sampling from finite populations,”The Annals of Mathematical Statistics, 14 (4), 333–362, 1943.

[6] J. Hájek, Sampling from a Finite Population. New York: Marcel Dekker, 1981.

[7] K. Liang and S. Zeger, “Longitudinal data analysis using generalized linear models, ”Biometrika, 73, 13–22, 1986.

[8] J. N. K. Rao and I. Molina, Small Area Estimation. Wiley, Hoboken, NJ, 2nd ed., 2015.

[9] Y. G. Berger and O. De La Riva Torres, “An empirical likelihood approach for inference under complex sampling design,” JRSS Series B, 78(2),.319–341, 2016.

[10] H. O. Hartley and J. N. K. Rao, “A new estimation theory for sample surveys, ”Biometrika, 55(3), 547–557, 1968.

[11] M. Oguz-Alper and Y. G. Berger, “Empirical likelihood approach for modelling survey data,” Biometrika, 103(2), 447–459, 2016.

[1]Statistics Norway, Postboks 8131 Dep, 0033, Oslo, Norway. This research is funded by the Economic and Social Research Council, United Kingdom.

[2]University of Southampton, Southampton Statistical Sciences Research Institute, Southampton, SO17 1BJ, United Kingdom.