5/01/03 252y0342Introduction

This is long, but that’s because I gave a relatively thorough explanation of everything that I did. When I worked it out in the classroom, it was much, much shorter.

The easiest sections were probably 1a, 2a, 2b, 2c, 7b, 7c, and 7d in Part I. There are some very easy sections in part I if you just follow suggestions and look at p-values and R-squared.

After doing the above, I might have computed the spare parts in 3b in Part II.

1

Spare Parts Computation:

1

And not recomputed them every time I needed them in Problems 4 and 5 ! Only then would I have tried the multiple regression.

Many of you seem to have no idea what a statistical test is. We have been doing them every day. The most common examples of this were in part II.

Problem 1b: and Hey look! They’re different! Whoopee! And you think that you will get credit for this? Whether these two proportions come from the same population or not, chances are they will be somewhat different, you need one of the three statistical tests shown in the solution to show that they are significantly different.

Problem 2c: Many of you started with This is fine and you got some credit for knowing how to compute a sample variance, though you probably had already computed the numerator somewhere else in this exam. But then you told me that this wasn’t 200. Where was your test?:

5/06/03 252y0342 ECO252 QBA2Name KEY

FINAL EXAMHour of Class Registered (Circle) May 7, 2003

I. (18 points) Do all the following. Note that answers without reasons receive no credit.

A researcher wishes to explain the selling price of a house in thousands on the basis of its assessed valuation, whether it was new and the time period. New is 1 if the house is new construction, zero otherwise. The researcher assembles the following data for a random sample of 30 home sales. Use in this problem.That’s why p-values above .10 mean that the null hypothesis of insignificance is not rejected.

————— 4/25/20039:58:00 PM ————————————————————

Welcome to Minitab, press F1 for help.

MTB > Retrieve "C:\Documents and Settings\RBOVE\My Documents\Drive D\MINITAB\2x0342-1.MTW".

Retrieving worksheet from file: C:\Documents and Settings\RBOVE\My Documents\Drive D\MINITAB\2x0342-1.MTW

# Worksheet was saved on Fri Apr 25 2003

Results for: 2x0342-1.MTW

MTB > print c1 - c4

Data Display

Row Price Value New Time

1 69.00 66.28 0 1

2 115.50 86.31 0 2

3 100.80 84.78 1 2

4 96.90 79.74 1 3

5 72.00 65.54 0 4

6 61.90 59.93 0 4

7 97.00 79.98 1 4

8 87.50 75.22 0 5

9 96.90 81.88 1 5

10 81.50 72.94 0 5

11 69.34 60.80 0 6

12 97.90 81.61 1 6

13 96.00 79.11 0 7

14 92.00 77.96 0 9

15 94.10 78.17 1 10

16 101.90 80.24 1 10

17 109.50 85.88 1 10

18 88.65 74.03 0 11

19 93.00 75.27 0 11

20 83.00 74.31 0 11

21 106.70 84.36 0 12

22 97.90 77.90 1 12

23 97.30 79.85 1 12

24 90.50 74.92 0 12

25 95.90 79.07 1 12

26 113.90 85.61 0 13

27 94.50 76.50 1 14

28 86.50 72.78 0 14

29 91.50 72.43 0 17

30 93.75 76.64 0 17

1. Looking for a place to start, the researcher does individual regressions of price against the individual independent variables.

a. Explain why the researcher concludes from the regressions that valuation (‘value’) is the most important independentvariable. Consider the values of and the significance tests on the slope of the equation (2)

b. What kind of variable is ‘new.’ Explain why the regression of ‘price’ against ‘new’ is equivalent to a test of the equality of 2 sample means, and what the conclusion would be. (2)

5/06/03 252y0342

Solution:a) Note that the p-values for the first two regressions are below 10%,indicating a significant slope. The regression against ‘Time,’ however, shows a p-value above 10% for the slope, indication that time is a poor explanatory variable. If we compare the regression against ‘Value’ with the regression against ‘New,’ we find a much higher R-sq for ‘Value,’ which indicates that this independent variable does a much better job of explaining ‘Price’ than ‘New.’

b) ‘New’ is a dummy variable, it is one when something is true and zero when it is false. The equation Price = 88.5 + 9.93 Newonly gives us two values for ‘Price.’ If the house is not new, the predicted price is 88.5 (thousand) and if the house is new, and we use the values from the ‘Coef’ column, the predicted price is 88.458 + 9.926 = 98.3 (thousand) . Remember that a regression predicts the average value of a dependent variable for a given value of the independent variable, so these are predicted means for old and new houses respectively, and the fact that the p-value is .031, below 10% indicates that the difference is significant.

MTB > regress c1 1 c2

Regression Analysis: Price versus Value

The regression equation is

Price = - 44.2 + 1.78 Value

Predictor Coef SE Coef T P

Constant -44.172 7.346 -6.01 0.000

Value 1.78171 0.09546 18.66 0.000The p-value below .10 shows us that the slope is significant.

S = 3.475 R-Sq = 92.6% R-Sq(adj) = 92.3%

Analysis of Variance

Source DF SS MS F P

Regression 1 4206.7 4206.7 348.37 0.000

Residual Error 28 338.1 12.1

Total 29 4544.8

Unusual Observations

Obs Value Price Fit SE Fit Residual St Resid

6 59.9 61.900 62.606 1.719 -0.706 -0.23 X

11 60.8 69.340 64.156 1.642 5.184 1.69 X

X denotes an observation whose X value gives it large influence.

MTB > regress c1 1 c3

Regression Analysis: Price versus New

The regression equation is

Price = 88.5 + 9.93 New

Predictor Coef SE Coef T P

Constant 88.458 2.759 32.07 0.000

New 9.926 4.362 2.28 0.031

S = 11.70 R-Sq = 15.6% R-Sq(adj) = 12.6%

Analysis of Variance

Source DF SS MS F P

Regression 1 709.3 709.3 5.18 0.031

Residual Error 28 3835.5 137.0

Total 29 4544.8

Unusual Observations

Obs New Price Fit SE Fit Residual St Resid

2 0.00 115.50 88.46 2.76 27.04 2.38R

6 0.00 61.90 88.46 2.76 -26.56 -2.33R

26 0.00 113.90 88.46 2.76 25.44 2.24R

R denotes an observation with a large standardized residual

5/06/03 252y0342

MTB > regress c1 1 c4

Regression Analysis: Price versus Time

The regression equation is

Price = 86.4 + 0.698 Time

Predictor Coef SE Coef T P

Constant 86.355 4.942 17.47 0.000

Time 0.6980 0.5057 1.38 0.178The p-value above .10 shows us that the slope is insignificant.

S = 12.33 R-Sq = 6.4% R-Sq(adj) = 3.0%

Analysis of Variance

Source DF SS MS F P

Regression 1 289.6 289.6 1.91 0.178

Residual Error 28 4255.2 152.0

Total 29 4544.8

Unusual Observations

Obs Time Price Fit SE Fit Residual St Resid

2 2.0 115.50 87.75 4.07 27.75 2.38R

6 4.0 61.90 89.15 3.27 -27.25 -2.29R

R denotes an observation with a large standardized residual

MTB > regress c1 2 c2 c4;

SUBC> dw;

SUBC> vif.

2. The researcher now adds time. Compare this regression with the regression with Value alone. Are the coefficients significant? Does this explain the variation in better than the regression with value alone?. What would the predicted selling price be for an old house with a valuation of 80 in time 17? (3)

Solution:a) This is working out beautifully. R-sq rose, as it almost always does. R-sq adjusted went up, which is better news. The low p-value associated with the ANOVA indicates a generally successful regression. The low p-value (.008 is much less than 1%) on the coefficient of ‘Time’ indicates that the coefficient is highly significant. Both other coefficients are even more significant as shown by the low p-values.

b) The regression equation is Price = - 45.0 + 1.75 Value + 0.368 Time, so, for the values given above Price = - 45.0 + 1.75(80) + 0.368(17)= -45.0 + 140 + 6.3 = 101.3 (thousand).

Regression Analysis: Price versus Value, Time

The regression equation is

Price = - 45.0 + 1.75 Value + 0.368 Time

Predictor Coef SE Coef T P VIF

Constant -44.988 6.553 -6.87 0.000

Value 1.75060 0.08576 20.41 0.000 1.0

Time 0.3680 0.1281 2.87 0.008 1.0

S = 3.097 R-Sq = 94.3% R-Sq(adj) = 93.9%

Analysis of Variance

Source DF SS MS F P

Regression 2 4285.8 2142.9 223.46 0.000

Residual Error 27 258.9 9.6

Total 29 4544.8

Source DF Seq SS

Value 1 4206.7

Time 1 79.2

5/06/03 252y0342

Unusual Observations

Obs Value Price Fit SE Fit Residual St Resid

2 86.3 115.500 106.842 1.385 8.658 3.13R

11 60.8 69.340 63.656 1.474 5.684 2.09R

20 74.3 83.000 89.146 0.680 -6.146 -2.03R

R denotes an observation with a large standardized residual

Durbin-Watson statistic = 2.73

3. The researcher now adds the variable ‘new’ Remember that there is nothing wrong with a negative coefficient unless there is some reason why it should not be negative.

a. What two reasons would I find to doubt that this regression is an improvement on the regression with just value and time by just looking at the t tests and the sign of the coefficients? What does the change in adjusted tell me about this regression? (3)

b. We have done 5 ANOVA’s so far. What was the null hypothesis in these ANOVA’s and what does the one where the null hypothesis was accepted tell us? (2)

c. What selling price does this equation predict for an old home with a valuation of 80 in time 17?What percentage difference is this from the selling price predicted in the regression with just time and value? (2)

d. The last two regressions have a Durbin-Watson statistic computed. What did this test for, what should our conclusion be, and why is it important? (3)

e. The column marked VIF (variance inflation factor) is a test for (multi)collinearity. The rule of thumb is that if any of these exceeds 5, we have a multicollinearity problem. None does. What is multicollinearity and why am I worried about it? (2)

f. Do an F test to show whether the regression with ‘value’, ‘time’ and ‘new’ is an improvement over the regression with ‘value’ alone. (3)

Solution: a) Note that the p-value of the coefficient of ‘New’ is well above 10%,indicating that the coefficient is not significant. The negative sign on ‘New’ may also give us problems,since we usually expect new construction to be more expensive. It is possible, however, that the assessors are systematically giving higher valuations to newer housing. R-sq did go up, but it almost always goes up. R-sq adjusted fell, indicating that the explanation is not really better.

b) All but one of the five ANOVAs gave us low p-values, indicating that there is a linear relation between the independent variables and the dependent variable. The high p-value on the regression against ‘Time’ alone indicates that this variable alone is no better than the average value of ‘Price’ in predicting ‘Price.’

c) Remember that we previously predicted that a home with a valuation of 80 (thousand) in time 17 would have Price = 101.3 (thousand). Our new equation givesPrice = - 47.7 + 1.79 Value + 0.351 Time - 1.22 New = - 47.7 + 1.79(80) + 0.351(17) - 1.22 New(0)

= -47.7 + 143.2 + 6.0 = 101.5 (thousand), a change of price of less than 0.2%.

d) The Durban – Watson statistic tests for a pattern (first-order autocorrelation) in the residuals. If there is autocorrelation, we could make a better prediction by including time patterns of price variation. In this case and giving us, on the 5% table in the text 1.21 and 1.65, while the regression printout says “Durbin-Watson statistic = 2.60”. The diagram in the outline reads

0 ? 2 ? 4

+ + + + + + +

Since the value given by Minitab is between and , we do not have significant autocorrelation.In the previous regression,the D-W statistic was 2.73, and giving us, on the 5% table in the text 1.28 and 1.57 and, again no significant autocorrelation.

e) Multicollinearity is close correlation between the independent variables and makes accurate values of the coefficients hard to get.

5/06/03 252y0342

f) The current regression has the ANOVA:

Source DF SS MS F P

Regression 3 4294.0 1431.3 148.42 0.000

Residual Error 26 250.7 9.6

Total 29 4544.8

Source DF Seq SS

Value 1 4206.7

Time 1 79.2

New 1 8.2

The regression against ‘Value’ alone gave:

Source DF SS MS F P

Regression 1 4206.7 4206.7 348.37 0.000

Residual Error 28 338.1 12.1

Total 29 4544.8

We can itemize the regression sum of squares in the current regression by usingeither the sequential sum of squares in the current regression or looking at the regression sum of squares in the ‘Value’ alone regression.

Source DF SS MS F

Value 1 4206.7 4206.7 438.20

2 more variables 3 87.9 42.95 4.47 2.98

Residual Error 26 250.7 9.6

Total 29 4544.8

The appropriate F test tests 4.47 against from the F table. Since our computed F is larger than the table F, we reject the null hypothesis that the two new independent variables have no explanatory value.

MTB > regress c1 3 c2 c4 c3;

SUBC> dw;

SUBC> vif.

Regression Analysis: Price versus Value, Time, New

The regression equation is

Price = - 47.7 + 1.79 Value + 0.351 Time - 1.22 New

Predictor Coef SE Coef T P VIF

Constant -47.675 7.190 -6.63 0.000

Value 1.79394 0.09804 18.30 0.000 1.3

Time 0.3508 0.1298 2.70 0.012 1.0

New -1.218 1.322 -0.92 0.366 1.3

S = 3.105 R-Sq = 94.5% R-Sq(adj) = 93.8%

Analysis of Variance

Source DF SS MS F P

Regression 3 4294.0 1431.3 148.42 0.000

Residual Error 26 250.7 9.6

Total 29 4544.8

Source DF Seq SS

Value 1 4206.7

Time 1 79.2

New 1 8.2

Unusual Observations

Obs Value Price Fit SE Fit Residual St Resid

2 86.3 115.500 107.862 1.777 7.638 3.00R

11 60.8 69.340 63.502 1.487 5.838 2.14R

20 74.3 83.000 89.492 0.778 -6.492 -2.16R

R denotes an observation with a large standardized residual

Durbin-Watson statistic = 2.60

5/06/03 252y0342

II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points - Anything extra you do helps, and grades wrap around) . Show your work! State and where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests.Remember: 1) Data must be in order for Lilliefors. Make sure that you do not cross up x and y in regressions.

1. (Berenson et. al. 1220) A firm believes that less than 15% of people remember their ads. A survey is taken to see what recall occurs with the following results (In these problems calculating proportions won’t help you unless you do a statistical test):

Medium

MagTVRadioTotal

Remembered 25 10 8 43

Forgot 73 93107273

Total 98103 115316

a. Test the hypothesis that the recall rate is less than 15% by using proportions calculated from the ‘Total’ column. Find a p-value for this result. (5)

b. Test the hypothesis that the proportion recalling was lower for Radio than TV. (4)

c. Test to see if there is a significant difference in the proportion that remembered according to the medium. (6)

d. The Marascuilo procedure says that if (i) equality is rejected in c) and

(ii) , where the chi – squared is what you used in c) and the standard deviation is what you would use in a confidence interval solution to b), you can say that you have a significant difference between TV and Radio. Try it! (5)

Solution:I have never seen so many people lose their common sense as did on this problem. Many of you seemed to think that the answer to c) was the answer to a) in spite of the fact that .15 appeared nowhere in your answer. An A student tried at one point to compare the total fraction that forgot with the total fraction that remembered, even though the method she used was intended to compare fractions of two different groups and the two fractions she compared could only have been the same if they were both .5, since they had to add to one.

a) From the formula table.

Interval for / Confidence Interval / Hypotheses / Test Ratio / Critical Value
Proportion /
/ / /

It is an alternate hypothesis because it does not contain an equality. The null hypothesis is thus Initially, assume and note than so that . This is a one-sided test and This problem can be done in one of three ways.

5/06/03 252y0342

(i) The test ratio is Make a diagram of a normal curve with a mean at zero and a reject zone below Since is not in the 'reject' zone, do not reject . We cannot say that the proportion who do not recall is significantly below 15%. We can use this to get a p-value. Since our alternate hypothesis is , we want a down-side value, i.e.

. Since the p-value is above the significance level, do not reject . Make a diagram. Draw a Normal curve with a mean at .15 and represent the p-value by the area below .1361, or draw a Normal curve with a mean at zero and represent the p-value by the area below -0.69.

(ii) Since the alternative hypothesis says we need a critical value that is below .15. We use Make a diagram of a normal curve with a mean at .15 and a ‘reject’ zone below .1168. Since is not in the 'reject' zone, do not reject . We cannot say that the proportion is significantly below 15%.

(iii) To do a confidence interval we need . To make the 2-sided confidence interval,, into a 1-sided interval, go in the same direction as We get . Thus the interval is

. does not contradict the null hypothesis.

b) We are comparing and

Interval for / Confidence Interval / Hypotheses / Test Ratio / Critical Value
Difference between proportions / / / Or use /

,, Note that and that and are between 0 and 1.

Our hypotheses are or or

There are three ways to do this problem. Only one is needed

5/06/03 252y0342

(i) Test Ratio: Make a Diagram showing a 'reject' region below -1.645. Since -0.7371 is above this value, do not reject

(ii) Critical Value: becomes . Make a Diagram showing a 'reject' region below - 0.06142. Since is not below this value, do not reject

(iii) Confidence Interval:: becomes . Since does not contradict , do not reject

c)

The proportions in rows,, are used with column totals to get the items in . Note that row and column sums in are the same as in . (Note that is computed two different ways here - only one way is needed.)

Row

1 25 13.3358 -11.6642 136.053 10.2020 46.866

2 73 84.6642 11.6642 136.053 1.6070 62.943

3 10 14.0162 4.0162 16.130 1.1508 7.135

4 93 88.9838 -4.0162 16.130 0.1813 97.198

5 8 15.6492 7.6492 58.510 3.7389 4.090

6 107 99.3508 -7.6492 58.510 0.5889 115.238

316 316.000 0.0000 17.4689 333.469

Since the computed here is greater than the from the table, we reject

d) The Marascuilo procedure says that if (i) equality is rejected in c) and

(ii) , where the chi – squared is what you used in c) and the standard deviation is what you would use in a confidence interval solution to b), you can say that you have a significant difference between TV and Radio.

OK – We already have

. I guess we really should use and . Since is obviously smaller than this, we do not have a significant difference in these 2 proportions.

5/06/03 252y0342

2. (Berenson et. al. 1142) A manager is inspecting a new type of battery. These are subjected to 4 different pressure levels and their time to failure is recorded. The manager knows from experience that such data is not normally distributed. Ranks are provided.

PRESSURE

Use low rank normal rank high rank whee! rank

1 8.2 11 7.9 9 6.2 4 5.3 1

2 8.3 12 8.4 13 6.5 5 5.8 2

3 9.4 15 10.0 17 7.3 7 6.1 3

4 9.6 16 11.1 18 7.8 8 6.9 6

5 11.9 19 12.5 20 9.1 14 8.0 10

a. At the 5% level analyze the data on the assumption that each column represents a random sample. Do the column medians differ? (5)

b. Rerank the data appropriately and repeat a) on the assumption that the data is non-normal but cross classified by use. (5)

c. This time I want to compare high pressure (H) against low - moderate pressure (L). I will write out the numbers 1-20 and label them according to pressure.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

H H H H H H H H L H L L L H L L L L L L

Do a runs test to see if the H’s and L’s appear randomly. This is called a Wald-Wolfowitz test for the equality of means in two nonnormal samples. Null hypothesis is that the sequence is random and the means are equal.What is your conclusion? (5)

Solution: a) This is a Kruskal – Wallis Test, Equivalent to one-way ANOVA when the underlying distribution is non-normal.

Columns come from same distribution or medians equal. I am basically copying the outline. There are data items, so rank them from 1 to 20. Let be the number of items in column and be the rank sum of column . .

1 8.2 11 7.9 9 6.2 4 5.3 1

2 8.3 12 8.4 13 6.5 5 5.8 2

3 9.4 15 10.0 17 7.3 7 6.1 3

4 9.6 16 11.1 18 7.8 8 6.9 6

5 11.9 19 12.5 20 9.1 14 8.0 10

73 77 38 22

To check the ranking, note that the sum of the four rank sums is 73.0 + 77.0 + 38.0 + 22.0 = 210.0, and that the sum of the first numbers is

Now, compute the Kruskal-Wallis statistic .

If the size of the problem is larger than those shown in Table 9, use the distribution, with , where is the number of columns. Comparewith . Since is larger than , reject the null hypothesis.