Statistics 475 Notes 15

Reading: Lohr, Chapters 6.1-6.2

Ruben’s question from last class: In planning a two stage cluster design, what happens if the cluster size is large but unknown?

Suppose the cost is and for sampling a unit within a cluster. The optimal sample size in each cluster can be written as

,

where . If is large, then the optimal m is approximately

I. Sampling clusters with unequal probabilities: motivating example

O’Brien et al. (1995, Journal of the American Medical Association) sought to estimate the preference of nursing home residents in the Philadelphia’s for life sustaining treatments. Do they wish to have cardiopulmonary resuscitation (CPR) if the heart stops beating, or to be transferred to a hospital if a serious illness develops, or to be fed through an enteral tube if no longer able to eat? The target population was all residents of licensed nursing homes in the Philadelphia area. There were 294 such homes with a total of 37,652 beds (before sampling, they only knew the number of beds, not the total number of residents).

Because the survey was to be done in person, cluster sampling was essential for keeping survey costs manageable. Consider a two stage cluster sample in which each nursing home has the same probability of being selected in the sample and then the subsample size for each home is proportional to the number of beds in the home. This is a self weighting sample, meaning that the weights for each bed in the sample are the same, i.e., each bed in the population has the same probability of being sample and each sampled bed represents the same number of beds in the population. However, this design has the following drawbacks:

(1) We would expect that the total number of patients in a home who desire CPR () is proportional to the number of beds in the home () so the unbiased estimator of the population mean would have large variance. Using a ratio estimator would help to alleviate this concern.

(2) A self-weighting equal probability sample may be cumbersome to administer. It may require driving out to a nursing home just to interview one or two residents, and equalizing workloads of interviewers may be difficult.

(3) The cost of the sample is unknown in advance – a random sample of 40 homes may consist primarily of large nursing homes, which would lead to greater expense than anticipated.

Instead of taking a cluster sample of homes with equal probabilities, the investigators randomly drew a sample of 57 nursing homes with probabilities proportional to the number of beds. This is called probability proportional to size sampling of clusters. They then took a simple random sample of 30 beds (and their occupants) from a list of all beds within the nursing home. If the number of residents equals the number of beds and if a home has the same number of beds when visited as are listed in the sampling frame, then the sampling design results in every resident having the same probability of being included in the sample. The cost is known before selecting the sample, the same number of interviews are taken at each home, and the estimator of a population total will likely have a smaller variance than if we had sampled the nursing homes with equal probabilities.

II. Unequal Probability Sampling with Replacement

We first consider how to sample clusters proportional to size with replacement. Sampling with replacement means that the selection probabilities do not change after we have drawn the first unit. Although sampling without replacement is more efficient, sampling with replacement is often used because of the ease in selecting and analyzing samples. Let

For sampling without replacement, is the probability that unit i is selected on the second draw, or the third draw, or any other given draw. The overall probability that unit i is selected in the sample at least once is

.

Example 1: Consider the following population of introductory statistics classes at a college. The college has 15 such classes; class ihas students, for a total of 64 students. We decide to sample 5 classes with replacement, with probability proportional to , and then collect a questionnaire from each student in the sampled classes. For this example, then .

Class Number / /
1 / 44 / 0.068006
2 / 33 / 0.051005
3 / 26 / 0.040185
4 / 22 / 0.034003
5 / 76 / 0.117465
6 / 63 / 0.097372
7 / 20 / 0.030912
8 / 44 / 0.068006
9 / 54 / 0.083462
10 / 34 / 0.052550
11 / 46 / 0.071097
12 / 24 / 0.037094
13 / 46 / 0.071097
14 / 100 / 0.154560
15 / 15 / 0.023184
Total / 647 / 1

One way to sample the clusters with probabilities is the cumulative size method. Calculate the cumulative totals of :

Class Number / / Cumulative Total of
1 / 0.068006 / 0.068006
2 / 0.051005 / 0.119011
3 / 0.040185 / 0.159196
4 / 0.034003 / 0.193199
5 / 0.117465 / 0.310665
6 / 0.097372 / 0.408037
7 / 0.030912 / 0.438949
8 / 0.068006 / 0.506955
9 / 0.083462 / 0.590417
10 / 0.052550 / 0.642967
11 / 0.071097 / 0.714065
12 / 0.037094 / 0.751159
13 / 0.071097 / 0.822257
14 / 0.154560 / 0.976816
15 / 0.023184 / 1.000000
Total / 1

Sample a random uniform number between 0 and 1.

> runif(1)

[1] 0.6633096

Find the smallest cluster i which has cumulative total above the sampled random uniform number. This is the sampled cluster.

For the above random uniform number, the sampled cluster is 11.

Another method for sampling clusters with probability proportional to is Lahiri’s (1951) method:

  1. Draw a random number between 1 and (the number of clusters). This indicates which cluster you are considering.
  2. Draw a random number between 1 and : if the random number is less than or , then include cluster i in the sample; otherwise, go back to step 1.
  3. Repeat until the desired sample size is obtained.

Lahiri’s method Example 1. The largest class has =100 students, so we generate pairs of random integers, the first between 1 and 15, the second between 1 and 100, until the sample has five clusters.

First random number (cluster to consider) / Second random number / / Action
12 / 6 / 24 / 6<24; include cluster 12 in sample
14 / 24 / 100 / Include in sample
1 / 65 / 44 / 65>44; discard pair of numbers and try again
7 / 84 / 20 / 84>20; try again
10 / 49 / 34 / Try again
14 / 47 / 100 / Include
15 / 43 / 15 / Try again
5 / 24 / 76 / Include
11 / 87 / 46 / Try again
1 / 36 / 44 / Include

Proof that Lahiri’s method produces a probability proportional to size sample:

Lahiri’s method is an example of rejection sampling. Consider selecting one cluster with Lahiri’s method. Let = 1 if the cluster is selected using the jth pair of random numbers and 0 otherwise. We have

Since this holds for all , the draw on which the cluster is selected is independent of the cluster selected and

III. Estimation Using Unequal Probability Sampling Without Replacement

Because we are sampling with replacement, the sample may contain the same unit more than once. To allow us to keep track of which clusters occur multiple times in the sample, define the random variables by

An unbiased estimate of the population total is

(1.1)

Note that and so that

, so that is an unbiased estimator of the population total. Estimator (1.1) can be motivated as follows. Suppose we just sample one cluster, . Then the cluster i chosen represents the proportion of the units in the population and so a natural estimator of the total is . Estimator (1.1) averages this estimate over the n clusters in the sample.

Variance of : If n=1, we have

Then, the estimator (1.1) is just the average of n observations, each with variance , so

(1.2)

To estimate from a sample, we might think that we could use a formula of the same form as (1.2) but this will not work. Equation (1.2) involves a weighted average of the , weighted by the unequal probabilities of selection. But in taking the sample, we have already used the unequal probabilities – they appear in the random variables . If we included the ’s again as multipliers in estimating the sample variance, we would be using the unequal probabilities twice. Instead, to estimate the variance, use

. is just the sample variance of the ’s divided by the sample size n.

is an unbiased estimate of the variance because

Example 1 continued: For the situation in Example 1 and the sample {12, 14, 14, 5, 1} we selected using Lahiri’s method, we have the following data:

Class / / /
12 / 24/647 / 75 / 2021.875
14 / 100/647 / 203 / 1313.410
14 / 100/647 / 203 / 1313.410
5 / 76/647 / 191 / 1626.013
1 / 44/647 / 168 / 2470.634

The numbers in the last column of the table are the estimates of that would be obtained if that cluster were the one selected in a sample of size 1. The population total is estimated by averaging the five values of :

The standard error of is simply where s is the sample standard deviation of the ’s:

.

Then, we estimate the average amount of time a student spent studying statistics by divided by the population size:

hours with .

The formulas for estimating the population total, population means and their standard errors are valid for any choice of sampling probabilities , regardless of whether the ’s are proportional to the size of the cluster. For example, if we are interested in obtaining an accurate estimate of the mean amount of time spent studying for the subpopulation of students on college sports teams, we might want to sample classes with more students on sports teams with higher probability.

IV. Two Stage Sampling With Replacement

If we sample clusters with unequal probabilities and then use simple random sampling within clusters, the estimators from one-stage unequal sampling are modified slightly to allow for different subsamples in clusters that are selected more than once:

Returning to Example 1, suppose we subsample five students in each class rather than observing .

Class / / / / / /
12 / 24 / 24/647 / 2,3,2.5,3,1.5 / 2.4 / 57.6 / 1552.8
14 / 100 / 100/647 / 2.5,2,3,0,0.5 / 1.6 / 160 / 1035.2
14 / 100 / 100/647 / 3,0.5,1.5,2,3 / 2.0 / 200 / 1294.0
5 / 76 / 76/647 / 1,2.5,3,5,2.5 / 2.8 / 212.8 / 1811.6
1 / 44 / 44/647 / 4,4.5,3,2,5 / 3.7 / 162.8 / 2393.9
average / 1617.5
std. dev. / 521.628

Thus, and . Then we estimate the average amount of time spent studying as hours (where K is the population size) with .

1