4/26/02 252y0232 ECO252 QBA2 Name KEY
(Page layout view!) THIRD HOUR EXAM Hour of Class Registered (Circle)
April 18, 2002 MWF TR 10 12 12:30 2:00
I. (10+ points) Do all the following;
1. Hand in your computer printouts for problems 2 and 3.(5 points – 3 point penalty for not handing in). remember that the ANOVA printout must be completed, using a 5% significance level, for full credit. I should be able to tell what is tested and what are the conclusions.
2. a. In particular, is the interaction between car and driver significant? Which numbers made you think that? (2)
b. Create two confidence intervals for the difference between the means for drivers 2 and 3, one that is valid alone, and one that is valid simultaneously with other similar intervals. Do these intervals show a significant difference between these two means? Why? (4)
Solution: The only parts of the solution to computer problem 2 that you need are:
Tabulated Statistics
ROWS: car COLUMNS: driver
1 2 3 ALL
1 42.000 25.000 12.667 26.556
2 32.000 28.000 29.333 29.778
3 30.667 45.000 28.333 34.667
4 31.333 24.667 54.667 36.889
ALL 34.000 30.667 31.250 31.972
CELL CONTENTS -- mpg:MEAN
MTB > twoway 'mpg''car''driver'
Two-way Analysis of Variance
Analysis of Variance for mpg
Source DF SS MS
car 3 590.3 196.8
driver 2 76.1 38.0
Interaction 6 3227.9 538.0
Error 24 336.7 14.0
Total 35 4231.0
To complete the printout, divide through the MS column by and place the results in the in the column. Then look up the corresponding values of in 5% lines on the F table.
Source DF SS MS
car 3 590.3 196.8 14.057s Car means identical
driver 2 76.1 38.0 2.714ns Driver means identical
Interaction 6 3227.9 538.0 38.428s No interaction
Error 24 336.7 14.0
Total 35 4231.0
The first and the third null hypotheses are rejected. a) Since 38.428 is larger than 2.51, we reject the hypothesis that there is no interaction and say that there is significant interaction.
b) Drivers 2 and 4 are in the columns. There are rows, columns and measurements per cell. Of course the number of degrees of freedom for 'within' or 'error.'
From the outline, we have for Bonferroni confidence intervals for column means
. This becomes, for
This indicates no significant difference.
4/18/02 252y0232
For Scheffe intervals for column means use . So . This indicates no significant difference.
c. In your income and education regression,
(i) Explain what coefficients are significant and why? (2)
(ii) What income would you predict for someone with 3 years of education? (1)
(iii) Make a confidence interval for the income of someone with 3 years of education using some of the information generated by Minitab below. (2)
Descriptive Statistics
Variable N Mean Median TrMean StDev SEMean
Educ 32 12.000 12.000 12.071 4.363 0.771
Variable Min Max Q1 Q3
Educ 4.000 20.000 8.000 16.000
Column Sum of Squares
Sum of squares (uncorrected) of Educ = 5198.0
Solution: The relevant output is:
Regression Analysis
The regression equation is
Income = 5078 + 732 Educ
Predictor Coef Stdev t-ratio p
Constant 5078 1498 3.39 0.002
Educ 732.4 117.5 6.23 0.000
s = 2855 R-sq = 56.4% R-sq(adj) = 55.0%
i) So we can state that, since the p-values are both below .05, that both coefficients are significant at the 5% level.
ii) The regression can be written as or . So or .
iii) From the outline The Confidence Interval is , where and . If we use , we get .
Please note the following from the 252 home page:
The rule on p-value:
If the p-value is less than the significance level (alpha) reject the null hypothesis; if the p-value is greater than or equal to the significance level, do not reject the null hypothesis.
Significance
This is a topic that was covered under hypothesis tests. Probably the first reference I made to this was even earlier when I said that a parameter is significant if it is not zero. I later said that a null hypothesis often says that a parameter or a difference between parameters is insignificant. If a result is significant we reject the null hypothesis.
To put this more generally, a result is (statistically) significant if it is larger or smaller than would be expected by chance alone. Thus in the case of a regression coefficient the measure of significance could be the p-value, which tells us the probability of getting our actual result or something more extreme if we assume that the population value of the coefficient is zero. If the p-value is small (below our significance level), then it is unlikely that our assumption about the coefficient is correct and we say that the coefficient is significant (or significantly different from zero). Of course, the various hypothesis tests that we have discussed here are also often ways of proving significance.
4/18/02 252y0232
II. Do at least 4 of the following 5 Problems (at least 10 each) (or do sections adding to at least 40 points - Anything extra you do helps, and grades wrap around) . Show your work! State and where applicable. Never say 'yes' or 'no' without a statistical test.
1. On the following pages there are printouts from two computer problems.
a. The One-way ANOVA Problem ( Albright, Winston, Zappe - abbreviated): An automobile parts producer
has instituted an employee empowerment program in five plants. Random samples of employees in each plant are asked to rate the success of the program on a 1 to 10 scale. 10 being the highest rating. They want to know if the program is being implemented with equal success at each plant and are thus looking to see if there is a significant difference between mean ratings at each plant. They are assuming that the results are distributed according to Normal distributions with similar variances.
(i) Indicate what hypothesis was tested, what the p-value was and whether, using the p-value, you would reject the null if (a) the significance level was 5% and (b) the significance level was 1%. Explain why. Does this mean that the success was equal in all plants? (3)
(ii) Do a 'normal' and a Scheffe confidence interval for the difference between the means in the two plants that were the least successful. Do these intervals indicate a difference in the success of the program between these two plants? Why? (4.5).
(iii) The printout gives 95% confidence intervals for the means for each plant. Find the numbers for the confidence interval for 'Midwest.' Why is this interval smaller than the others? (2.5)
(iv) I would question whether ANOVA was appropriate for this problem because there is no evidence that the underlying populations are Normally distributed. What method would I prefer for this problem? (1)
One-way ANOVA problem
Worksheet size: 100000 cells
MTB > RETR 'C:\MINITAB\2X0232-1.MTW'.
Retrieving worksheet from file: C:\MINITAB\2X0232-1.MTW
Worksheet was saved on 4/ 9/2002
MTB > print c1-c5
Data Display
Row south midwest n-east s-west west
1 7 7 7 6 6
2 1 6 5 4 6
3 8 10 5 7 6
4 7 3 5 10 6
5 2 9 4 7 3
6 9 10 3 6 4
7 3 8 4 6 8
8 8 4 5 7 6
9 5 3 5 4 2
10 7 2 3 3 4
11 4 7 3 7 5
12 7 3 8 6
13 5 5 9 4
14 10 5 10 7
15 10 4 4
16 6 10 3
17 3 4 5
18 5 6 4
19 2 7
20 6 6
21 4 4
22 5
23 2
24 7
25 8
26 7
4/18/02 252y0232
MTB > AOVOneway c1 c2 c3 c4 c5.
One-Way Analysis of Variance
Analysis of Variance
Source DF SS MS F p
Factor 4 46.24 11.56 2.50 0.049
Error 85 393.55 4.63
Total 89 439.79
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ---+------+------+------+---
south 11 5.545 2.697 (------*------)
midwest 26 6.000 2.623 (------*------)
n-east 14 4.429 1.158 (------*------)
s-west 18 6.556 2.229 (------*------)
west 21 5.048 1.532 (------*------)
---+------+------+------+---
Pooled StDev = 2.152 3.6 4.8 6.0 7.2
Solution: a) (i) All one-way ANOVAs test for equality of the means of the populations represented by the columns, so is . The p-value is 4.9%, so we reject the null hypothesis at the 5% significance level, but not the 1% level. If we reject the null hypothesis we say that the success level was not the same at all the plants.
(ii) The Northeast and the West plants were the least successful. From the outline if we desire a single interval and we want the difference between means of column 1 and column 2. , where . This becomes
If we desire intervals that will simultaneously be valid for a given confidence level for all possible intervals between column means, use , which becomes since both these intervals include zero, there is no significant difference.
(iii) If we use the 'normal' formula for the difference between two means, we get
. It is the smallest interval because we divide the pooled standard deviation by the square root of , which is the largest of all the sample sizes.
b. The Regression Problem: This relates the number of shares in thousands to the age of board members of a corporation.
(i) Looking at significance tests and the value of R-squared, how successful is this regression? Why? Why shouldn't this surprise you? (3)
(ii) Note that c1 contains 'shares' and that c4 contains predicted values of 'shares.' Add a regression line to the graph. (1)
(ii) What equation relates the number of shares owned to the age of the board member? How many shares does it say that we should expect a 83-year old board member to own? Would you take this seriously? Why? (2)
4/18/02 252y0232
Regression Problem
Worksheet size: 100000 cells
MTB > RETR 'C:\MINITAB\2X0232-5.MTW'.
Retrieving worksheet from file: C:\MINITAB\2X0232-5.MTW
Worksheet was saved on 4/12/2002
MTB > echo
MTB > Execute 'C:\MINITAB\252SOLS3.MTB' 1.
Executing from file: C:\MINITAB\252SOLS3.MTB
MTB > #252sols3
MTB > print c1 c2
Data Display
Row shares age
1 7.9 53
2 66.4 60
3 29.7 69
4 60.5 49
5 10.4 67
6 28.7 68
7 86.9 46
8 121.1 62
9 35.3 63
10 2.8 55
11 74.4 57
12 13.1 71
13 9.1 66
14 19.1 70
15 18.8 66
16 3.1 57
17 96.5 54
18 47.0 64
19 31.1 56
MTB > plot c1*c2 (plot omitted)
MTB > regress c1 on 1 c2 c3 c4
Regression Analysis
The regression equation is
shares = 153 - 1.86 age
Predictor Coef Stdev t-ratio p
Constant 152.95 64.82 2.36 0.031
age -1.860 1.061 -1.75 0.098
s = 33.01 R-sq = 15.3% R-sq(adj) = 10.3%
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 3348 3348 3.07 0.098
Error 17 18522 1090
Total 18 21870
Unusual Observations
Obs. age shares Fit Stdev.Fit Residual St.Resid
8 62.0 121.10 37.65 7.70 83.45 2.60R
R denotes an obs. with a large st. resid.
MTB > plot c4*c2 (plot omitted)
MTB > plot c4*c2 c1*c2;
SUBC> symbol;
SUBC> type 3 1;
SUBC> color 8 9;
SUBC> overlay.
MTB > end
4/18/02 252y0232
Solution: b ) (i) This is a very unsuccessful regression - surely the author could have found a better predictor of the number of shares owned than age! is very small on a zero to one scale and the p-value for the slope is above 5%. The regression seems to say that the number of shares owned declines as the board member gets older. I see no reason why this should be true.
(ii) To add a regression line, just connect the x's.
(iii) The regression equation says shares = 153 - 1.86 age. If a board member is 83 Of course, you can't own negative shares, and the fact that the oldest board member is 71 might lead us to feel that we have exceeded our competence. Basically the low leaves us unsure whether we should take any of its results seriously.
4/18/02 252y0232
2. A researcher believes that the data below has a Normal distribution with a mean of 80 and a standard deviation of 5. For your convenience the values of are computed for you.
a. Use a chi-squared test to find out if the distribution is correct. (9)
b. Is there a better way to do this problem than chi-squared? Why? Do it. (5)
c. Assume that, instead of using population means given above, we actually checked the data and found that and How would this change what we did in a)? (1)
d. Assume that, instead of using population means given above, we actually checked the data and found that and How would this change what we did in b)? (1)
Observed
x interval z interval Frequency
below 74 below -1.2 23
74-78 -1.2 to -0.4 53
78-82 -0.4 to 0.4 52
82-86 0.4 to 1.2 46
86-90 1.2 to 2.0 24
above 90 above 2.0 2
200
Solution: a) We find the cumulative distribution of , , and use it to find the frequency .We then find , where is the cumulative probability. In the first column and . is the difference between successive values of . For example,