Stat 511 Fall 2003

Midterm 1

Statistics 511

Midterm 1

Oct. 23, 2003

The following rules apply.

1.  You may use 3 sheets of paper for any information you need - double-sided, any font.

2.  You may use a calculator.

3.  You may not collaborate or copy.

4.  Failure to comply with item 3 could lead to reduction in your grade, or disciplinary action.

I have read the rules above and agree to comply with them.

Signature ______

Name (printed) ______


1. A hospital recorded 104 births last year, of which 61 were boys. (For this study, identical twins are considered a single birth.) Genetic theory says that the probability of a child being a boy is 50%, and that (excepting identical twins) all births are independent for gender, income and other factors.

a.(5

) An environmental group in the town speculated that the "excess of boy babies" could be due to the use of hormone-like organic chemicals at a nearby factory. Is there evidence that the probability of a child being a boy in this community is higher than 50%? Justify your response with a statistical test.

H0: P≤0.5 HA: P>0.5

Test statistic: check: nP>5 104*0.5=52 OK

Z*=

= (0.5865 -0.5)/0.049

= 1.7666

Rejection Region or P-value

This is a one-sided test. The estimated percentage favors the alternative hypothesis. So, from the table:

.025 < p-value < .05

Alternatively, the rejection region at alpha=.05 is Z*> 1.645

Conclusion: We should reject the null hypothesis. We are unlikely to observe this many boys if the true percentage is 50%.


b. (5) A smaller private hospital in town recorded only 21 births. Based on the data from the larger hospital, what is a 95% interval for the number of boys born at this hospital?

Formula: ) where m=21 and n=104

Interval: )

= 21(0.353,0.820)

= (7,18)

c. (2) Suppose that the percentage of male births for this town is actually 50% as suggested by genetic theory, that the larger hospital records about 100 births annually, and the smaller hospital records about 20 births annually. In a randomly selected year, which hospital has a greater chance that 65% of more of the babies are boys? (Do not compute this probability, but justify your answer in one or 2 sentences based on a statistical argument.)

The smaller hospital has a larger probability of having 65% of more boys in any year, because the variance of the yearly percentage is greater. Hence, there will be more years that have a percentage far from 50%.
2 According to the Environmental Protection Agency, the maximum allowable PCB concentration in drinking water is 0.5 ppb (parts per billion). A well was sampled 16 times. The data were thought to be normally distributed after taking natural logarithms. The mean and standard deviation of the log(concentrations) are:

sample size: 16

mean : -0.73 log(ppb)

s.d. : 0.035

a. (2) Sketch the normal probability plot of the original data, which would lead the investigator to take the natural logarithm. (Label the x and y axes with variable names, and indicate the shape of the plot. Clearly you cannot supply data values.)


b. (4) Compute a 99% confidence interval for the mean concentration in ppb. for water from this well. (Compute your interval to 3 decimal places.)

Formula: exp()

Interval: for ln(concentration) -0.73 ± 2.947(0.035/√14

= (-0.756,-0.704)

for concentration in ppb: (exp (-0.756), exp(-0.704))

= (0.469,0.495)

c. (4) Compute a 99% prediction interval for concentrations of samples (in ppb) for water from this well. (Compute your interval to 3 decimal places.)

Formula: exp()

Interval: for ln(concentration) -0.73 ± 2.947(0.035)sqrt(1+1/16)

= (-0.836,-0.0624)

for concentration in ppb: (exp (-0.836), exp(-0.0624))

= (0.433, 0.536)


d. (2) The owner of the well complains that a 99% interval is too strict. Would a 99% interval or a 95% interval lead to an outcome more favorable to the well owner? Justify your answer briefly.

The owner would like to show a lower concentration of PCB. Since the center of both the prediction and confidence interval is BELOW the EPA standard of 5 ppb, the owner wants a shorter interval that will be below 5 ppb. A 95% interval will be shorter and will favor the owner (unless the owner's concern is to be healthy, then he would prefer the longer interval).

e. (5) An adult who lived near the well used the well water as the main water supply over a 10 year period. Continuous exposure at levels above 0.5 ppb can lead to cancer. Assuming that the PCB level has been fairly constant during the 10 years, is this person likely to develop cancer due to exposure from this water? Justify your answer using ONE of the intervals computed above. (Please, no detailed medical explanations - just stick to the information given.)

Which interval should be used? confidence

Why is this the appropriate interval?

Continuous exposure is best represented by the mean over many samples. The person has used the water for 10 years, so has drunk hundreds of "samples". Prediction here refers to a single sample - not a single person.

What conclusion do you draw from this interval about cancer risk from drinking water from the well over a 10 year period?

The upper limit of the 99% CI is below 5 ppb. This person likely has not had sufficient exposure to be at risk of cancer due to exposure to this water.

Note i) We cannot interpret the 99% PI or CI to be the probability that the PCB level is above 5 ppb. Even if the level is above 5 ppb, we cannot interpret the alpha-level of the interval to be the probability that the person will develop cancer. I took off points for these wrong interpretations.

ii) In part c, we talk about "long-term exposure". On the EPA web page, it notes that short-term exposure can also lead to ill-effects. Short-term exposure would be a PI. The water from this well should not be used for drinking.)


3. Carbon aerosols have been identified as a contributing factor in a number of air quality problems. In a chemical analysis of diesel engine exhaust, the mass of exhaust (x) and the amount of elemental carbon were recorded.

Some of the SAS output is attached.

a) (2) Below is a plot of carbon versus mass, with the fitted regression line. On this plot, label the point with the highest leverage and the largest (in absolute value) residual.


b. (1) The faculty member looks at the residual and notes that the largest residual is r=30. Is this cause to suspect an outlier? Justify your answer.

We cannot determine if a point is an outlier from the residuals. We need to look at the studentized residuals.

c.(2) The plot of Cook's distance versus mass is below. Why does the data value at mass=387 have the largest influence? Your response should refer to the plot of carbon versus mass.

Answer: Influence is a combination of leverage and residual. This data value does not have a large residual, so it must be influential due to the large leverage.


The residual plots looked fine, and the data appeared to be roughly normally distributed, so the investigator proceeded with the regression analysis. The SAS output is below (with some blanks). The sample size is 25.

The REG Procedure

Model: MODEL1

Dependent Variable: carbon

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 64087 64087 458.15 <.0001

Error 3217.27311 139.88144

Corrected Total 67304

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 30.98933 5.04626 6.14 <.0001

mass 1 0.73660 0.03441 21.40 <.0001

d. (2) What is the value of R2? What does this value indicate about the use of mass to predict carbon?

R2 = SSR/SSTo = 64087/67304 = 95.2%

Mass is a very good predictor of carbon.

e. (2) The investigator expected the regression to go through the origin, since there cannot be carbon if there is no mass. Using the information on the computer output, test whether or not the intercept is zero.

You could compute this from the formulas, but this is not required because the test is on the computer output under "Parameter Estimates"

t*=6.14, p<.0001

We reject the null hypothesis that the intercept is zero.

f. (6) Compute a 95% confidence interval for the mean carbon when the mass is 350. (Some relevant univariate information is below.

The UNIVARIATE Procedure

Variable: mass

Moments

N 25 Sum Weights 25

Mean 129.528 Sum Observations 3238.2

Std Deviation 70.1528244 Variance 4921.41877

Skewness 2.48687253 Kurtosis 7.22713866

Uncorrected SS 537551.62 Corrected SS 118114.05

Coeff Variation 54.1603548 Std Error Mean 14.0305649

Variable: carbon

Moments

N 25 Sum Weights 25

Mean 126.4 Sum Observations 3160

Std Deviation 52.9559565 Variance 2804.33333

Skewness 2.00401884 Kurtosis 4.97027971

Uncorrected SS 466728 Corrected SS 67304

Coeff Variation 41.8955352 Std Error Mean 10.5911913

Formula for interval:

d.f. for the t-distribution 23

t-value from the table 2.069

Interval: (30.989+0.737(350))

= 288.799 ± 16.443

= (272.356, 305.242)


g. (6) Suppose that the investigator decided to drop the data value with x=387, because of its high influence. What affect would this have on the quantities below (that is, would they be larger, smaller, or virtually unchanged)?

A) the true variance of the estimated slope?

B) the mean squared error?

C) the confidence interval you computed in part f?

Justify your answers by referring to the appropriate plots and output.

A) The true variance of the estimated slope will be bigger because

Var(slope)=s2/Sxx . With the largest x-value removed, Sxx will be smaller, so the variance will increase.

B) the mean squared error will be (approximately) unchanged because

the residual for that point is about average. (However, the refitted line might fit the other points better, so the MSE might go down a bit more than the denominator (n-2) increases, so I also accepted a reasonable argument that the MSE decreases.)

C) the confidence interval will be longer because (at least 3 reasons)

a) Sxx will be smaller

b) will be smaller (and it is already less than 350) so (350 -)2 will be bigger.

c) the d.f. for the t-value will go down, so the t-value will be bigger

d) 1/n will be bigger

e) MSE will be about the same

Of these items, a and b are usually the most important unless the sample size is small or the line radically changes.

11