Master of Applied Statistics

Applied Statistics Comprehensive Exam

January 2017

SOLUTIONS

Directions: This is a closed book exam with a 3-hour time limit. Attached you will find the relevant computer output, two pages of formulas, and tables for the normal, t, 2, and F distributions. You may use a non-programmable, non-graphing calculator.

Answer Only Five of the Six Questions.

1)In a Statistics class of 50 students, each student drew a random sample of size 10 from the standard normal distribution. Each student calculated the sample mean and sample standard deviation using the 10 observations they drew.

  1. The sample mean and standard deviation calculated by a student, John, using the 10 observations he drew, should be close to what values?

The sample mean should be close to 0 and sample standard deviation should be close to 1.

  1. The standard error of the sample mean calculated from a student, John, should be close to what value?

The standard error of the sample mean should be close to .

  1. John conducted a one-sample t-test using the 10 samples he drew to examine whether the population mean is zero. See the SAS output handout page 1, labeled “Problem 1 output”. What should the value A and B be in the output? Write down the null and alternative hypotheses and make a conclusion about the hypothesis test based on the result. Report and carefully interpret the 95% confidence interval for the mean.

hence.

Hence A=0.54; the value of B=9.

Let represents the population mean. The null and alternative hypotheses for this t-test can be written as:

Based on the output, we do not reject the null hypothesis and conclude that there is no evidence that the population mean is different from 0. A 95% confidence interval for the mean is (-0.56, 0.54). Interpretation: We can be 95% confident that the true population mean is at least -0.56 and at most +0.54.

  1. Then the students recorded all 50 sample means calculated by the 50 students. The standard deviation of these 50 sample means should be close to what value?

The mean of sampling means should be close to the population mean which is 0. The standard deviation of the sampling distribution of means should be close to

  1. Using the 10 observations they drew, each student calculated a 95% confidence interval. How many students do you expect would have their 95% confidence interval include 0?

We would expect 50*0.95=47.5 95% confidence intervals to cover 10.

  1. John and Mary want to compare the samples they randomly drew samples (10 observations for each student). Due to the small sample size, they would like to consider a non-parametric approach. Identify a nonparametric approach that would be appropriate for this setting and describe the null/alternative hypotheses. What are the pros and cons of using the suggested nonparametric approach versus implementing a two sample t-test?

The Wilcoxon rank sum test can be used to test the following hypotheses. H0: The two populations are identical. H1: The two populations are shifted from each other. The test statistic is the sum of the ranks in sample 1. The advantage of implementing the non-parametric Wilcoxon rank sum test is it is more robust if the data violate normal assumption. However, the disadvantage is the non-parametric Wilcoxon rank sum test is less powerful if the normal assumption fits the data.

2)The article “Who Wants Airbags” in Chance 18 (2005): 3-16 discusses whether air bags should be mandatory equipment in all new automobiles. The data shown in the table below were obtained from accidents in which there was a harmful event (personal or property), and from which at least one vehicle was towed.

Air Bag Installed
Yes / No / Total
Killed / 200 / 300 / 500
Survived / 60,000 / 40,000 / 100,000
Total / 60,200 / 40,300 / 100,500
  1. Calculate the estimated probabilities of being killed in a harmful event car accident for a vehicle with and without air bags, and the difference of these two proportions.

The probability of being killed in a harmful event car accident in a vehicle with air bug is and . And the difference of the two proportions is 0.0074-0.0033=0.0041.

  1. Calculate the estimated odds of being killed in a harmful event car accident for a vehicle with and without air bags.

Odds of an event is the probability of the event happening versus the probability of the event not happening. The odds of being killed in a harmful event car accident for a vehicle with air bags is =0.0033. The odds of being killed in a harmful event car accident for a vehicle without air bags is

  1. Calculating the estimated odds ratio of being killed in a harmful event car accident with air bags vs. without air bags. What does the ratio tell you about the importance of having air bags in a vehicle?

The odds ratio is the ratio of the two odds in (a): =0.4444. The odds of being killed in a harmful event accident is 0.44 times less in a car with air bags versus a car without air bags.

  1. Calculate the estimated expected cell counts if survival status is independent of air bag installation.

The expected cell counts is as follows:

Air Bag Installed
Yes / No / Total
Killed / 299.5 / 200.5 / 500
Survived / 59900.5 / 40,099.5 / 100,000
Total / 60,200 / 40,300 / 100,500
  1. Suggest a hypothesis testing procedure to determine whether survival status is independent of air bags installation. Write down the null hypothesis and the test statistic, approximate the P-value, and give a careful interpretation.

vs.

The test statistic is written as: TS=

TS==84.85

The P-value <0.0001 is less than the significance level (0.05) and we concluded that the data suggested strong evidence that air bag installation is associated with survival status in harmful event car accidents.

3)Athletes are constantly seeking measures of the degree of their cardiovascular fitness prior to a major race. One such measure of fitness is the time to exhaustion from running on a treadmill at a specified angle and speed. The important question is then “Does this measure of cardiovascular fitness translate into performance in a 10-km running race?” Twenty experienced distance runners who professed to be at top condition were evaluated on the treadmill and then had their times recorded in a 10-km race. Twenty data pairs are given: Treadmill Time (X, measured in minutes) and 10-km Time (Y, measured in minutes). These data were used to compute the least-squares regression line. See the SAS output on the output handout page 2, labeled “problem 3 fitness data”.

  1. Refer to the output. Does a linear model seem appropriate?

The data appear to fall approximately along a downward-sloping line with fairly constant error variance; hence a linear seems appropriate. The does not appear to be a need for a more complex model.

  1. From the output, obtain the estimated linear regression model , and interpret the regression coefficients.

The estimated intercept ( is 59.92 and the estimated slope ( is -1.96. The negative slope corresponds to the downward-slopping line in the scatter plot. is the expected time of running a 10-km race when a runner spends 0 minute on the treadmill. For , we conclude that for 1 minute increase in the treadmill exhaustion test, we expect a decrease of 1.96 minutes in the average value of 10-km race time.

  1. Identify the standard error of , and calculate a 95% confidence interval on

The estimated standard error of is 0.316. Because n is 20, there are 20-2=18 df for error. The required table value in a t-distribution for is 2.101. The corresponding confidence interval for the true value of is then -1.95

  1. Is there strong evidence that the true mean time to run a 10-km race decreases with the treadmill run time? Formulate hypotheses and give a test statistic, P-value, and careful interpretation.

In the model, the null hypothesis is versus . In the output, it showed that the t statistics t=-6.19 with a (two-sided) p-value<0.0001. Since the negative estimate support this one-sided Ha, the one-sided P-value would be half the two-sided P-value, and hence also < .0001. There is strong evidence to conclude that the true mean amount of time needed to run a 10km race decreases with the time to exhaustion on a treadmill.

  1. Find the residual standard deviation. What does this value indicate about the fitted regression line?

The residual standard deviation is 1.921 based on the SAS output. About 95% of the prediction errors will be less then 2(1.921)=3.842, and similarly a predicted race time using this regression on treadmill time will be accurate to approximately ± 3.8 minutes.

  1. What should the value of A, B, C be?

The value of A should be 1; B should be 18; C should be 0.681.

  1. What is the Pearson correlation between time on a treadmill and 10-km race?

The Pearson correlation between time on a treadmill and 10-km race is

  1. Using the information obtained from the output, comment on how well the time on treadmill explains the performance in a 10-km race.

Using the information from the output, we concluded that the time to exhaustion on a treadmill explains about 68.1% of the variation in the time needed to run a 10km race. The Pearson correlation =0.824 indicates a strong correlation between time on a treadmill and 10-km race.

4)Y = blood coagulation time (s) versus diet was studied in cats. Four diets were to be compared (A, B, C, D) in a one-factor ANOVA with multiple comparisons. Sample sizes for the four diets were slightly unequal: 4, 6, 6, and 8 cats were used under diets A, B, C, D respectively. Refer to the output labeled Problem 4 (pages 3-5).

  1. What are the full formal assumptions of the analysis, and are they believable here? Cite results in the output to support your opinions.
  2. Interpret the ANOVA F = 13.57 result (P < .0001).
  3. Carefully interpret the results of the multiple comparison analysis (this is Tukey’s HSD method).
  4. A colleague opines that instead of the Tukey method, Fisher’s “protected LSD” method should be used for multiple comparisons here. Discuss the advantages and disadvantages of these two methods.

5)An experiment was conducted to investigate the pharmacokinetics (changes in drug concentration over time) in the liver of a rat. After some time, the amount y of the drug in the liver was measured as a proportion of the administered dose. A multiple regression of y on x1=body weight, x2=liver weight, and x3=dose was conducted. Refer to the output labeled Problem 5 (pages 6-14).

  1. Comment on the believability of the formal regression assumptions. Also, find any influential observations (see the last page) and assess collinearity.
  2. Regardless of any concerns you may have, interpret the overall F-test for the regression and also the individual significance tests for each predictor.
  3. How well will this regression predict y (specifically, what would be the approximate margin of error for a 95% prediction interval if it were calculated at a typical x-value combination?)
  4. What would you do next with this data?

6)An animal researcher is interested in studying the cholesterol level (y = LDL) in dogs based on their breed and diet. They randomly select 3 of the many possible breeds to study, using 9 dogs of each breed. For diet the researcher will uses the most highly recommended dry dog-food, the most highly recommended wet dog-food, and the most highly recommended vegetarian dog-food. The nine dogs in each breed will be divided randomly among the three diets and their cholesterol level will be measured after one year.

  1. Give the model equation you would use for this analysis (e.g. yijk = etc etc), being sure to identify the variables and parameters used in terms of the problem. What are the assumptions of your model?
  2. The possible df and expected mean squares formulas for this model are given below. If we are testing for the main effect of Diet, based on your answer to (a) give the formula for the appropriate F statistic in terms of the MS (e.g. MS___/MS___) and state what its degrees of freedom would be.

df Both RandomRandom Breed, Fixed Diet

Breed 2 2+92Breed+32Breed*Diet2+92Breed +32Breed*Diet

Diet2 2+92Diet+32Breed*Diet2+Q(Diet)+32Breed*Diet

Breed*Diet4 2+32Breed*Diet2+32Breed*Diet

Error 18 22

df Fixed Breed, Random DietBoth Fixed

Breed 2 2+Q(Breed)+32Breed*Diet2+Q(Breed)

Diet2 2+92Diet+32Breed*Diet2+Q(Diet)

Breed*Diet4 2+32Breed*Diet2+Q(Breed*Diet)

Error 18 22

  1. What impact would there be on the analysis if the interaction term in the model is found to be significant, and what would this mean about LDL in dogs?
  2. What impact would there be on the analysis if it turns out they are only able to use 3 dogs from each breed due to a lack of people willing to foster the animals as part of the study?

1 of 3