4/26/02 252y0232 ECO252 QBA2 Name KEY

(Page layout view!) THIRD HOUR EXAM Hour of Class Registered (Circle)

April 18, 2002 MWF TR 10 12 12:30 2:00

I. (10+ points) Do all the following;

1. Hand in your computer printouts for problems 2 and 3.(5 points – 3 point penalty for not handing in). remember that the ANOVA printout must be completed, using a 5% significance level, for full credit. I should be able to tell what is tested and what are the conclusions.

2. a. In particular, is the interaction between car and driver significant? Which numbers made you think that? (2)

b. Create two confidence intervals for the difference between the means for drivers 2 and 3, one that is valid alone, and one that is valid simultaneously with other similar intervals. Do these intervals show a significant difference between these two means? Why? (4)

Solution: The only parts of the solution to computer problem 2 that you need are:

Tabulated Statistics

ROWS: car COLUMNS: driver

1 2 3 ALL

1 42.000 25.000 12.667 26.556

2 32.000 28.000 29.333 29.778

3 30.667 45.000 28.333 34.667

4 31.333 24.667 54.667 36.889

ALL 34.000 30.667 31.250 31.972

CELL CONTENTS -- mpg:MEAN

MTB > twoway 'mpg''car''driver'

Two-way Analysis of Variance

Analysis of Variance for mpg

Source DF SS MS

car 3 590.3 196.8

driver 2 76.1 38.0

Interaction 6 3227.9 538.0

Error 24 336.7 14.0

Total 35 4231.0

To complete the printout, divide through the MS column by and place the results in the in the column. Then look up the corresponding values of in 5% lines on the F table.

Source DF SS MS

car 3 590.3 196.8 14.057s Car means identical

driver 2 76.1 38.0 2.714ns Driver means identical

Interaction 6 3227.9 538.0 38.428s No interaction

Error 24 336.7 14.0

Total 35 4231.0

The first and the third null hypotheses are rejected. a) Since 38.428 is larger than 2.51, we reject the hypothesis that there is no interaction and say that there is significant interaction.

b) Drivers 2 and 4 are in the columns. There are rows, columns and measurements per cell. Of course the number of degrees of freedom for 'within' or 'error.'

From the outline, we have for Bonferroni confidence intervals for column means

. This becomes, for

This indicates no significant difference.

4/18/02 252y0232

For Scheffe intervals for column means use . So . This indicates no significant difference.

c. In your income and education regression,

(i) Explain what coefficients are significant and why? (2)

(ii) What income would you predict for someone with 3 years of education? (1)

(iii) Make a confidence interval for the income of someone with 3 years of education using some of the information generated by Minitab below. (2)

Descriptive Statistics

Variable N Mean Median TrMean StDev SEMean

Educ 32 12.000 12.000 12.071 4.363 0.771

Variable Min Max Q1 Q3

Educ 4.000 20.000 8.000 16.000

Column Sum of Squares

Sum of squares (uncorrected) of Educ = 5198.0

Solution: The relevant output is:

Regression Analysis

The regression equation is

Income = 5078 + 732 Educ

Predictor Coef Stdev t-ratio p

Constant 5078 1498 3.39 0.002

Educ 732.4 117.5 6.23 0.000

s = 2855 R-sq = 56.4% R-sq(adj) = 55.0%

i) So we can state that, since the p-values are both below .05, that both coefficients are significant at the 5% level.

ii) The regression can be written as or . So or .

iii) From the outline The Confidence Interval is , where and . If we use , we get .

Please note the following from the 252 home page:

The rule on p-value:

If the p-value is less than the significance level (alpha) reject the null hypothesis; if the p-value is greater than or equal to the significance level, do not reject the null hypothesis.

Significance

This is a topic that was covered under hypothesis tests. Probably the first reference I made to this was even earlier when I said that a parameter is significant if it is not zero. I later said that a null hypothesis often says that a parameter or a difference between parameters is insignificant. If a result is significant we reject the null hypothesis.

To put this more generally, a result is (statistically) significant if it is larger or smaller than would be expected by chance alone. Thus in the case of a regression coefficient the measure of significance could be the p-value, which tells us the probability of getting our actual result or something more extreme if we assume that the population value of the coefficient is zero. If the p-value is small (below our significance level), then it is unlikely that our assumption about the coefficient is correct and we say that the coefficient is significant (or significantly different from zero). Of course, the various hypothesis tests that we have discussed here are also often ways of proving significance.


4/18/02 252y0232

II. Do at least 4 of the following 5 Problems (at least 10 each) (or do sections adding to at least 40 points - Anything extra you do helps, and grades wrap around) . Show your work! State and where applicable. Never say 'yes' or 'no' without a statistical test.

1. On the following pages there are printouts from two computer problems.

a. The One-way ANOVA Problem ( Albright, Winston, Zappe - abbreviated): An automobile parts producer

has instituted an employee empowerment program in five plants. Random samples of employees in each plant are asked to rate the success of the program on a 1 to 10 scale. 10 being the highest rating. They want to know if the program is being implemented with equal success at each plant and are thus looking to see if there is a significant difference between mean ratings at each plant. They are assuming that the results are distributed according to Normal distributions with similar variances.

(i) Indicate what hypothesis was tested, what the p-value was and whether, using the p-value, you would reject the null if (a) the significance level was 5% and (b) the significance level was 1%. Explain why. Does this mean that the success was equal in all plants? (3)

(ii) Do a 'normal' and a Scheffe confidence interval for the difference between the means in the two plants that were the least successful. Do these intervals indicate a difference in the success of the program between these two plants? Why? (4.5).

(iii) The printout gives 95% confidence intervals for the means for each plant. Find the numbers for the confidence interval for 'Midwest.' Why is this interval smaller than the others? (2.5)

(iv) I would question whether ANOVA was appropriate for this problem because there is no evidence that the underlying populations are Normally distributed. What method would I prefer for this problem? (1)

One-way ANOVA problem

Worksheet size: 100000 cells

MTB > RETR 'C:\MINITAB\2X0232-1.MTW'.

Retrieving worksheet from file: C:\MINITAB\2X0232-1.MTW

Worksheet was saved on 4/ 9/2002

MTB > print c1-c5

Data Display

Row south midwest n-east s-west west

1 7 7 7 6 6

2 1 6 5 4 6

3 8 10 5 7 6

4 7 3 5 10 6

5 2 9 4 7 3

6 9 10 3 6 4

7 3 8 4 6 8

8 8 4 5 7 6

9 5 3 5 4 2

10 7 2 3 3 4

11 4 7 3 7 5

12 7 3 8 6

13 5 5 9 4

14 10 5 10 7

15 10 4 4

16 6 10 3

17 3 4 5

18 5 6 4

19 2 7

20 6 6

21 4 4

22 5

23 2

24 7

25 8

26 7

4/18/02 252y0232

MTB > AOVOneway c1 c2 c3 c4 c5.

One-Way Analysis of Variance

Analysis of Variance

Source DF SS MS F p

Factor 4 46.24 11.56 2.50 0.049

Error 85 393.55 4.63

Total 89 439.79

Individual 95% CIs For Mean

Based on Pooled StDev

Level N Mean StDev ---+------+------+------+---

south 11 5.545 2.697 (------*------)

midwest 26 6.000 2.623 (------*------)

n-east 14 4.429 1.158 (------*------)

s-west 18 6.556 2.229 (------*------)

west 21 5.048 1.532 (------*------)

---+------+------+------+---

Pooled StDev = 2.152 3.6 4.8 6.0 7.2

Solution: a) (i) All one-way ANOVAs test for equality of the means of the populations represented by the columns, so is . The p-value is 4.9%, so we reject the null hypothesis at the 5% significance level, but not the 1% level. If we reject the null hypothesis we say that the success level was not the same at all the plants.

(ii) The Northeast and the West plants were the least successful. From the outline if we desire a single interval and we want the difference between means of column 1 and column 2. , where . This becomes

If we desire intervals that will simultaneously be valid for a given confidence level for all possible intervals between column means, use , which becomes since both these intervals include zero, there is no significant difference.

(iii) If we use the 'normal' formula for the difference between two means, we get

. It is the smallest interval because we divide the pooled standard deviation by the square root of , which is the largest of all the sample sizes.

b. The Regression Problem: This relates the number of shares in thousands to the age of board members of a corporation.

(i) Looking at significance tests and the value of R-squared, how successful is this regression? Why? Why shouldn't this surprise you? (3)

(ii) Note that c1 contains 'shares' and that c4 contains predicted values of 'shares.' Add a regression line to the graph. (1)

(ii) What equation relates the number of shares owned to the age of the board member? How many shares does it say that we should expect a 83-year old board member to own? Would you take this seriously? Why? (2)


4/18/02 252y0232

Regression Problem

Worksheet size: 100000 cells

MTB > RETR 'C:\MINITAB\2X0232-5.MTW'.

Retrieving worksheet from file: C:\MINITAB\2X0232-5.MTW

Worksheet was saved on 4/12/2002

MTB > echo

MTB > Execute 'C:\MINITAB\252SOLS3.MTB' 1.

Executing from file: C:\MINITAB\252SOLS3.MTB

MTB > #252sols3

MTB > print c1 c2

Data Display

Row shares age

1 7.9 53

2 66.4 60

3 29.7 69

4 60.5 49

5 10.4 67

6 28.7 68

7 86.9 46

8 121.1 62

9 35.3 63

10 2.8 55

11 74.4 57

12 13.1 71

13 9.1 66

14 19.1 70

15 18.8 66

16 3.1 57

17 96.5 54

18 47.0 64

19 31.1 56

MTB > plot c1*c2 (plot omitted)

MTB > regress c1 on 1 c2 c3 c4

Regression Analysis

The regression equation is

shares = 153 - 1.86 age

Predictor Coef Stdev t-ratio p

Constant 152.95 64.82 2.36 0.031

age -1.860 1.061 -1.75 0.098

s = 33.01 R-sq = 15.3% R-sq(adj) = 10.3%

Analysis of Variance

SOURCE DF SS MS F p

Regression 1 3348 3348 3.07 0.098

Error 17 18522 1090

Total 18 21870

Unusual Observations

Obs. age shares Fit Stdev.Fit Residual St.Resid

8 62.0 121.10 37.65 7.70 83.45 2.60R

R denotes an obs. with a large st. resid.

MTB > plot c4*c2 (plot omitted)

MTB > plot c4*c2 c1*c2;

SUBC> symbol;

SUBC> type 3 1;

SUBC> color 8 9;

SUBC> overlay.

MTB > end


4/18/02 252y0232

Solution: b ) (i) This is a very unsuccessful regression - surely the author could have found a better predictor of the number of shares owned than age! is very small on a zero to one scale and the p-value for the slope is above 5%. The regression seems to say that the number of shares owned declines as the board member gets older. I see no reason why this should be true.

(ii) To add a regression line, just connect the x's.

(iii) The regression equation says shares = 153 - 1.86 age. If a board member is 83 Of course, you can't own negative shares, and the fact that the oldest board member is 71 might lead us to feel that we have exceeded our competence. Basically the low leaves us unsure whether we should take any of its results seriously.


4/18/02 252y0232

2. A researcher believes that the data below has a Normal distribution with a mean of 80 and a standard deviation of 5. For your convenience the values of are computed for you.

a. Use a chi-squared test to find out if the distribution is correct. (9)

b. Is there a better way to do this problem than chi-squared? Why? Do it. (5)

c. Assume that, instead of using population means given above, we actually checked the data and found that and How would this change what we did in a)? (1)

d. Assume that, instead of using population means given above, we actually checked the data and found that and How would this change what we did in b)? (1)

Observed

x interval z interval Frequency

below 74 below -1.2 23

74-78 -1.2 to -0.4 53

78-82 -0.4 to 0.4 52

82-86 0.4 to 1.2 46

86-90 1.2 to 2.0 24

above 90 above 2.0 2

200

Solution: a) We find the cumulative distribution of , , and use it to find the frequency .We then find , where is the cumulative probability. In the first column and . is the difference between successive values of . For example,