Office Hour and R Session Notes

Stat 475/920 Notes 2

Reading: Lohr, Chapter 2

Office hour and R session notes:

Prof. Small’s office hours: Tues., Thurs., 4:45-5:45; by appointment. Huntsman Hall 464.

Xin Fu’s office hours: Monday, 5-6.

Introductory tutorial on R: Monday, Sept. 15th, 5-6, Huntsman Hall 441.

Goal of survey sampling is to obtain information about a population by examining only a fraction of the population.

To simplify presentation, we assume for now that the sampled population is the target population, that the sampling frame is complete, that there is no nonresponse or missing data and that there are no errors of observation. We will return to nonsampling errors later.

I. Population

In survey sampling, we are typically concerned with a finite population (universe) with units labeled .

Let be a characteristic associated with the th unit, e.g., the th person in the population’s income.

The population quantities we will most often be interested in estimating are

(1)The population mean : .

(2)The population total : .

(3)The population variance :

(4)The population standard deviation : .

The population standard deviation can be thought of as measuring the typical absolute deviation of a randomly chosen unit’s from the population mean.

The coefficient of variation (CV)is a measure of variability in the population relative to the mean:

It is sometimes helpful to have a special notation for proportions. The proportion of units having a characteristic is simply a special case of the mean, obtained by letting if the th unit has the characteristic of interest (e.g., has income above $100,000 per year) and if the th unit does not have the characteristic of interest. Let be the proportion of units with the characteristic of interest:

II. Simple Random Sampling

Probability sampling: Each unit in the population has a known probability of being selected in the sample and a chance method such as using numbers from a random number table is used to choose the specific units to be included in the sample.

The simplest type of probability sampling is simple random sampling without replacement (SRS). A simple random sample without replacement of size can be done by putting the numbers into a hat and randomly selecting of them. In a simple random sample with replacement, after each draw, we put the number drawn back into the hat. For short, we’ll call simple random sampling without replacement justsimple random sampling as simple random sampling without replacement is preferable to simple random sampling with replacement for a finite population.

In a simple random sample of size , every possible subset of distinct units has the same probability of being selected into the sample. There are possible samples and each is equally likely, so the probability of selecting any individual sample of units is

Let denote the probability that the th unit will appear in the sample. In simple random sampling, for all the units (we’ll prove this later).

We can draw a simple random sample of size from a population of size in R by the following commands:

> n=25

> N=500

> sample(N,n,replace=FALSE)

[1] 274 218 160 249 50 451 271 496 378 178 84 3 328 19 141 190 279 248 221

[20] 37 425 106 40 404 43

Example of simple random sampling: The Effect of Agent Orange on Troops in Vietnam

During the Vietnam War, the U.S. army engaged in herbicidal warfare in which the objective was to destroy forests that provided cover for the Viet Cong army. The most widely used herbicide was known as Agent Orange.

A UH-1D helicopter from the 336th Aviation Company sprays a defoliation agent on a dense jungle area in the Mekong Delta. 07/26/1969/National Archives photograph.

Many Vietnam veterans are concerned that their health may have been affected by exposure to Agent Orange. The particularly worrisome component of Agent Orange is dioxin, which in high doses is known to be associated with certain cancers. High levels of dioxin can be detected 20 or more years after heavy exposure to Agent Orange. To examine the dioxin levels in Vietnam veterans, researchers from the Centers from Disease Control used military records to identify a sampling frame of 646 living U.S. Army combat personnel who served in Vietnam during 1967 and 1968 in the areas that were most heavily treated with Agent Orange (Centers for
Disease Control Veterans Health Studies, “Serum 2, 3, 7, 8 – Tetrachlorodibenzo-p-dixoin Levels in U.S. Army Vietnam-era Veterans,” Journal of the American Medical Association 260, Sept. 2, 1988: 1249-1254).

In the actual study, the dioxin levels in 1987 were obtained from all 646 veterans. The dioxin level is measured as concentration in parts per trillion. The data can be found on the class web site under Data Sets. The data can be read into R by the following command

agent.orange.data=read.table("agent_orange_data.txt",

header=TRUE)

The population mean is

dioxin=agent.orange.data$dioxin

mean(dioxin)

[1] 4.260062

The population standard deviation is

sd(dioxin)

[1] 2.642617

Histogram of the population distribution of dioxin

hist(dioxin)

As a point of comparison, the mean dioxin levels in 1987 in a sample of veterans who served between 1965-1971 in United States and Germany was 4.19.

Suppose that instead of obtaining the dioxin levels of all 646 veterans, we would like to take a sample of size 50.

n=50

N=646

sample1=sample(N,n,replace=FALSE)

sample1

[1] 132 527 542 110 475 354 458 586 370 389 544 626 538 324 159 359 272 233 213

[20] 422 238 522 596 182 329 37 74 480 563 528 623 38 81 610 111 543 577 525

[39] 344 335 4 198 83 452 145 249 455 418 540 175

mean(dioxin[sample1])

[1] 4.42

Another sample is

sample2=sample(N,n,replace=FALSE

sample2

[1] 198 501 254 392 8 417 237 221 363 146 102 390 600 388 453 55 233 211 292

[20] 34 190 549 185 579 478 111 583 217 407 103 441 263 46 452 486 199 499 70

[39] 401 529 219 1 68 227 477 415 618 77 341 362

> mean(dioxin[sample2])

[1] 3.78

How accurate is the sample mean as an estimate of the sample mean? The key to answering this question is to understand the sampling distribution of the sample mean.

III. Sampling Distribution

Let denote the sample mean. The sampling distribution of is the distribution of over repeated samples.

Mean and variance of the sampling distribution of (to be proved later):

Mean: .

The sample mean is an unbiased estimator of the population mean.

Variance: .

Finite population correction:

The usual formula for the variance of the sample mean of independent and identically distributed random variables is the variance divided by the sample size. The members of our sample are not independent because we are sampling without replacement. For simple random sampling without replacement, we multiply by an additional factor , which is called the finite population correction (fpc). Intuitively, we make this correction because with small populations, the greater our sampling fraction, the more information we have about the population and thus the smaller the variance. If we take a complete census, i.e, we sample the whole population so that , then the fpc is 0 and since will always equal .

For most samples that are taken from extremely large populations, the fpc is approximately 1. For large populations it is the size of the sample taken, not the sampling fraction, that determines the precision of the estimator. For example, a sample of size 100 from a population of 100,000 has almost the same precision as a sample of size 100 from a population of 100 million:

If your soup is well stirred, you need to taste only one or two spoonfuls to check the seasoning, whether you have made 1 liter or 20 liters of soup.

The variance of involves the population variance , which depends on the values for the entire population. We can estimate the population variance by the sample variance:

where denotes the sample.

An unbiased estimator of the variance of is

We usually report not the estimated variance of but its square root, the standard error (SE):

We can simulate the sampling distribution of for the Agent Orange data:

nosims=2000; # of samples to be simulated

n=50; # size of sample

N=646; # size of population

samplemean=rep(0,nosims);

for(i in 1:nosims){

tempsample=sample(N,n,replace=FALSE);

samplemean[i]=mean(dioxin[tempsample]);

}

mean(samplemean) # mean of sample means

[1] 4.25794

mean(dioxin) # population mean

[1] 4.260062

sd(samplemean)

[1] 0.3562426

truesd.samplemean=((sd(dioxin)/sqrt(50))*(1-50/646)) # true SD of sample mean

> truesd.samplemean

[1] 0.3447966

hist(samplemean) # histogram of sample means

Estimating proportions: For proportions, the above formulas for the sampling distribution can be specialized.

Let denote the population proportion and denote the sample proportion.

Then,

Also,

Example:

Suppose we are interested in the proportion of Vietnam veterans with dioxin level greater than 5.

n=50;

N=646;

tempsample=sample(N,n,replace=FALSE);

phat=mean(dioxin[tempsample]>5);

var.phat=(phat*(1-phat)/(n-1))*(1-n/N);

se.phat=sqrt(var.phat);

phat

[1] 0.14

se.phat

[1] 0.04761262

Estimating the population total:

The natural estimate of the population total is .

(Note that )

The sampling distribution of is

IV. Confidence Intervals

Consider our first sample from the Vietnam veterans population:

n=50

N=646

sample1=sample(N,n,replace=FALSE)

mean(dioxin[sample1])

[1] 4.42

It is not sufficient to just report the sample mean. We would like to give an idea of the accuracy of our estimate of the population mean and a range of plausible values for the population mean. This is done by means of a confidence interval (CI).

A 95% confidence interval for a population parameter should have the following property:

If we take samples from our population over and over again and construct a confidence interval using our procedure for each of the samples, 95% of the resulting intervals should include the true value of the population parameter.

Another way of thinking about confidence intervals is if we take a survey on a different subject each day and come up with a valid 95% confidence interval for the population mean, then over our lifetime, about 95% of the intervals will contain the true population means.

In introductory statistics, for situations in which the population is infinite, a central limit theorem for infinite population sampling with replacement was introduced and used to form approximate confidence intervals:

If are iid random variables with mean and variance , then under regularity conditions, for sufficiently large,

has approximately a standard normal distribution.

For finite population sampling, the usual central limit theorem cannot be applied because the sample size cannot exceed the population size. However, there is a finite population central limit theorem in which we consider a sequence of populations and sample sizes such that the population sizes and sample sizes increase to infinity. Hájek (1960) proved that if certain technical conditions hold and if are all “sufficiently large,” then

has approximately a standard normal distribution

For “sufficiently large,” an approximate 95% confidence interval for the population mean is

(1.1)

Note that if is small, then this confidence interval is approximately the same as the usual confidence interval from introductory statistics.

The imprecise term sufficiently large in the central limit theorem occurs because the adequacy of the normal approximation depends on n and how closely the population resembles a population generated from a normal distribution. The magic number of , often cited in introductory statistics books as a sample size that is “sufficiently large” for the central limit theorem to apply, often does not suffice in finite population sampling problems. Many populations are highly skewed.

Simulation study of performance of confidence intervals for the Vietnam veteran agent orange study:

Recall that the dioxin distribution if fairly skewed.

To simulate the coverage (proportion of times the confidence interval contains the true population parameter) for the approximate 95% CI (1.1) for a sample size of n (n=50 in the code), we use the following R code:

nosims=10000;

n=50;

N=646;

samplemean=rep(0,nosims); # vector to store the sample mean

lowerci=rep(0,nosims); # vector to store lower end point of confidence interval

upperci=rep(0,nosims); # vector to store upper end point of confidence interval

popmean=mean(dioxin); # population mean

for(i in 1:nosims){

tempsample=sample(N,n,replace=FALSE); # take sample

samplemean[i]=mean(dioxin[tempsample]); # compute sample mean

se=sqrt((1-n/N)*sd(dioxin[tempsample])^2/n);

# computes standard error of sample mean

lowerci[i]=samplemean[i]-1.96*se; # lower end point of CI

upperci[i]=samplemean[i]+1.96*se; # upper end point of CI

}

hist(samplemean) # histogram of sample means

ci.coverage=mean((lowerci<=popmean)*(upperci>=popmean));

# Proportion of times confidence interval contains the population mean

ci.coverage

Sample size n / Coverage of large sample size 95% CI (1.1)
15 / 0.903
30 / 0.911
50 / 0.919
75 / 0.921
100 / 0.922
300 / 0.921

The confidence interval only has 91% coverage for and around 92% coverage for .

For a more normally distributed population, the confidence intervals perform better.

We simulate data from a Gamma distribution with shape=5 and scale=1.

# Simulation from Gamma distribution

N=646

gammapop=rgamma(N,shape=10,rate=1);

hist(gammapop);

For this population, the simulated confidence interval coverage is

Sample size n / Coverage of large sample size 95% CI (1.1)
15 / 0.929
30 / 0.937
50 / 0.946
75 / 0.946
100 / 0.946
300 / 0.949

For skewed populations, stratified sampling is useful for obtaining more accurate estimates and more accurate confidence intervals (stratified sampling will be studied in Chapter 4).

Confidence intervals for proportions: For a proportion, the central limit theorem based confidence interval is

which, for small and n large , is approximately

Brown, Cai and DasGupta (2001, Statistical Science) show that this confidence interval has erratic coverage properties. They recommend several alternatives, the easiest of which is to use

in place of , i.e. the CI is

( is the Bayes estimate of from a Beta(2,2) prior).