Homework 5, Statistics 112, Fall 2004
This homework is due Tuesday, October 19th at the beginning of class.
1. In most jurisdictions, driving an automobile with a blood alcohol level in excess of .08 is a felony. Because of a number of factors, it is difficult to provide guidelines on when it is safe for someone who has consumed alcohol to drive a car. In an experiment to examine the relationship between blood alcohol level and the weight of a drinker, 50 men of varying weights were each given three beers to drink and 1 hour later their blood alcohol level was measured. The data are stored in bloodalcohol.JMP on the web site.
(a) Fit a simple linear regression model to predict blood alcohol level based on weight. Check the assumptions of the simple linear regression model by constructing a residual plot and a normal quantile plot of the residuals. Do these plots indicate any problems with the assumptions of the simple linear regression model? If yes, what problems are indicated and what indicates the problem. If no, what indicates that there are no problems.
Solution:
Bivariate Fit of B/A Level By Weight
Linear Fit
B/A Level = 0.0331795 + 0.000225 Weight
Summary of Fit
RSquare / 0.174495RSquare Adj / 0.157297
Root Mean Square Error / 0.013979
Mean of Response / 0.0774
Observations (or Sum Wgts) / 50
Parameter Estimates
Term / Estimate / Std Error / T Ratio / Prob>|t| /Intercept / 0.0331795 / 0.014023 / 2.37 / 0.0221
Weight / 0.000225 / 0.000071 / 3.19 / 0.0025
Distributions
Residuals B/A Level
We can see that there is no obvious pattern in the residual plot, in particular the mean of the residuals for all ranges of X appears to be roughly zero and the spread of the residuals appears to be roughly constant. From the normal quantile plot, we see that all points are within the 95% confidence bands so the normality assumption appears reasonable. Thus, there are no clear problems with the assumptions of the simple linear regression model.
For the rest of the problem, we will assume that the simple linear regression model holds in spite of any problems you may have found in part (a)
(b) Give a 95% confidence interval for the amount by which the mean blood alcohol level changes for a one pound increase in weight.
Solution:
We need a 95% confidence interval for the slope, which is
(0.000225-2*0.000071, 0.000225+2*0.000071)=(0.000083,0.00036771)
(c) Is there strong evidence that weight is associated with blood alcohol level? State hypotheses, give a p-value and state your conclusion.
Solution:
H0: blood alcohol level is not linear related to weight. (slope=0).
H1: blood alcohol level is related to weight. (slope>0 or slope<0).
Using the t test the p-value is 0.0025. Because the p-value is <0.05, we reject the null hypotheses. There is strong evidence that weight is associated with blood alcohol level.
2. Problem 1 continued.
(a) Calculate a 95% confidence interval for the mean blood alcohol level one hour later after drinking three beers for the population of 160 pound men.
Solution: We use JMP to find 95% confidence intervals for the mean response and 95% prediction intervals.
Bivariate Fit of B/A Level By Weight
Using the crosshair tool, the 95% confidence interval for the mean blood alcohol level one hour after drinking three beers for 160 pound men is approximately ( 0.063, 0.076)
(b) Steve is 160 pounds and thinks he can drive legally one hour after drinking three beers. Give a 95% prediction interval for Steve’s BAC. Given that driving with a blood alcohol level greater than .08 is illegal, can Steve be confident that he won’t be arrested if he drives and is stopped?
Solution: Using the crosshair tool, a 95% prediction interval for Steve’s BAC is approximately (0.040,0.098).
Because 0.08 is in the 95% prediction interval, Steve cannot be confident that he won’t be arrested if he drives.
(c) The police want to establish guidelines for whether it is safe for a 160 pound man to drive one hour after drinking three beers. What would you advise the police based on the regression analysis?
I would think that the police are conservative, and would only advise that it is safe for someone to drive if they think it is unlikely that the person will have a blood alcohol level above 0.08. The 95% prediction interval for the blood alcohol level of a 160 pound man one hour after drinking three beers is (0.040,0.098). Because the 95% prediction interval contains 0.08, it is not unlikely that a 160 pound man will have a blood alcohol level above 0.08 one hour after drinking three beers. I would advise the police to recommend that it is not safe for a 160 pound man to drive one hour after drinking three beers.
3. The data in wineheart.JMP are the average wine consumption rates (in liters per person) and number of ischemic heart disease deaths (per 1,000 men aged 55 to 64 years old) for 18 industrialized countries (Data from A.S. St Leget et al., “Factors Associated with Cardiac Mortality in Developed Countries with Particular Reference to the Consumption of Wine”, Lancet, 1979).
(a) Fit a simple linear regression to predict mortality from heart disease based on wine consumption. Construct a residual plot. What is the most obvious problem you see with the residual plot compared to what you would expect to see if the ideal simple linear regression model holds?
Solution:
Bivariate Fit of Heart Disease Mortality By Wine Consumption
Linear Fit
Heart Disease Mortality = 7.6865549 - 0.0760809 Wine Consumption
Summary of Fit
RSquare / 0.555872RSquare Adj / 0.528114
Root Mean Square Error / 1.618923
Mean of Response / 6.433333
Observations (or Sum Wgts) / 18
Analysis of Variance
Source / DF / Sum of Squares / Mean Square / F Ratio /Model / 1 / 52.485428 / 52.4854 / 20.0256
Error / 16 / 41.934572 / 2.6209 / Prob > F
C. Total / 17 / 94.420000 / 0.0004
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / 7.6865549 / 0.473322 / 16.24 / <.0001
Wine Consumption / -0.076081 / 0.017001 / -4.48 / 0.0004
The residual plot has a pattern in the mean of the residuals like a "U". In an ideal linear regression residual plot, there is no pattern.
(b) Using Tukey’s Bulging rule, try three appropriate transformations to try to achieve a better fit. Use the transformation of x to log(x) and y to log(y) as one of your transformations. Report the transformations you tried. Which achieves the best fit (explain the reason for your answer)?
Solution: I tried the following three transformations.
1. Transform X to log X and Y to log Y. The root mean square error measured on the original scale is: 1.6116274.
2. Transform X to and Y to . The root mean square error is: 1.475877.
3. Transform X to 1/X and Y to 1/Y. The root mean square error is: 3.0843388.
So, the transformation of X to and Y to has the smallest RMSE. It achieves the best fit.
For the remaining part of the problem, we use the transformation of x to log (x) and y to log(y).
Bivariate Fit of Heart Disease Mortality By Wine Consumption
Transformed Fit Log to Log
Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption)
Summary of Fit
RSquare / 0.738433RSquare Adj / 0.722085
Root Mean Square Error / 0.228537
Mean of Response / 1.78335
Observations (or Sum Wgts) / 18
Analysis of Variance
Source / DF / Sum of Squares / Mean Square / F Ratio /Model / 1 / 2.3591756 / 2.35918 / 45.1698
Error / 16 / 0.8356647 / 0.05223 / Prob > F
C. Total / 17 / 3.1948403 / <.0001
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / 2.5555519 / 0.126897 / 20.14 / <.0001
Log(Wine Consumption) / -0.355596 / 0.052909 / -6.72 / <.0001
Fit Measured on Original Scale
Sum of Squared Error / 41.557487Root Mean Square Error / 1.6116274
RSquare / 0.5598656
Sum of Residuals / 2.3201106
(c) Using the transformation of x to log (x) and y to log (y), which country’s heart disease mortality rate is most surprisingly high given its wine consumption rate? Which country’s heart disease mortality rate is most surprisingly low given its wine consumption rate? Using the rule of thumb that a point with a residual that is more than three root mean square errors away from zero is an outlier in the direction of the scatterplot, would you consider either of these two countries outliers in the direction of the scatterplot?
Solution:
From saving the residuals, we find that the country whose mortality rate is most surprisingly high given its wine consumption is Australia (residual = 3.03) and the country whose mortality rate is most surprisingly low given its wine consumption is Norway (residual = -2.73). The root mean square error of the fit measured on the original scale is 1.61. Neither of these countries has a residual that is more than three root mean square errors away from zero, so neither would be considered an outlier in the direction of the scatterplot using the rule of thumb.
(d) Using the transformation of x to log (x) and y to log (y), predict the heart disease mortality rate for a country with a wine consumption of 6 liters per person.
Solution:
From the regression:
Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption)
So, the estimated heart disease mortality rate for a country with a wine consumption of 6 liters per person is
4. Problem 3 continued.
(a) Is there strong evidence that wine consumption is associated with heart disease mortality? State hypotheses, give a p-value and state your conclusion. If you found that there is strong evidence that wine consumption is associated with heart disease mortality, what is the direction of the association?
Solution:
Assuming the simple linear regression model holds for Y=log(heart disease mortality) and X=log(wine consumption), , we can test if heart disease mortality is associated with wine consumption by testing whether the slope is zero for the regression of log(heart disease mortality) on log(wine consumption). The null hypothesis is H0: and the alternative hypothesis is for the regression of log(heart disease mortality) on log(wine consumption). The t statistic is -6.72 and the p-value is <0.0001. Thus, we reject H0. There is strong evidence that wine consumption is associated with heart disease mortality, and they are negative related. From the sign of the slope, more wine consumption is associated with lower heart disease mortality.
(b) Based on your regression analysis, your friend decides to drink more wine. Perhaps your friend is just using your regression analysis as an excuse, but anyhow, comment on whether your regression analysis justifies your friend’s decision to drink more wine. Discuss some additional data you would be interested in collecting to better understand the causal relationship between wine drinking and heart disease (see Section 2.5 of Moore and McCabe on Establishing Causation).
The regression analysis establishes a negative association between wine consumption and heart disease mortality, but it does not establish that more wine consumption causes lower heart disease mortality. An important lurking variable is diet. For example, countries which consume less wine might consume more red meat. It would be useful to collect additional data on the diet of the different countries and to see whether or not there is still an association between heart disease mortality and wine consumption if we hold fixed diet. It would also be good to see if the association is consistent by doing studies of the association between wine consumption and heart disease mortality in different regions and on individuals rather than countries/regions.