Multiple Regression Analysis and Regression Diagnostics

Multiple Regression Analysis and Regression Diagnostics

CHAPTER 14

MULTIPLE REGRESSION ANALYSIS AND REGRESSION DIAGNOSTICS

1.a.Multiple regression equation

b.One dependent, four independent

c.A partial regression coefficient

d. 0.002

e.105.014 found by = 11.6 + 0.4 (6) + 0.286 (280) + 0.112 (97) + 0.002 (35)

3.a.142000, found by = 120000 – 21000(3) + 1.0 (10000) + 1.5 (50000)

b.Assuming values of other variables are held constant, an increase in one competitor, on average, results in reduction of sales by $21000. Interpret other coefficients similarly.

c.Estimated sales revenue = 95000. Cost of running Pizza Delight = 95000 (0.65) = 61750. Profit = 95000-61750 = $33250, which is much larger than the salary of $10000. Diane should accept her friend’s advice.

5.a.SourcedfSSMSF

Regression37500 250018

Error182500 138.89

Total2110000

b.Ho: 1 = 2 = 3 = 0H1: Not all ‘s are 0Reject Ho if F > 3.16

Since F = 18, Reject Ho. Not all partial regression coefficients equal zero.

c.For X1for X2for X3

Ho: 1 = 0Ho: 2 = 0Ho: 3 = 0

H1: 1 0H1: 2 0H1: 3 0

t = 4.00t = 1.50t = 3.00

Reject Ho if t > 2.101 or t2.101

Variable X2 may be deleted.

7.a.The value of adjusted R2 = 0.7716 indicates that all independent variables together explain 77.16% variation in the dependent variable, when adjustment has been made for the loss of degrees of freedom due to the number of coefficients (four including the constant term) estimated by the model.

b.Ho: 1=2=3=…=k=0 and

H1: At least one of 1, 2, 3…k 0

The sample value of the test statistic is F = 14.648. =3.41

Since F> , we reject Ho and conclude that the independent variables together are statistically significant.

c. The critical value of for one tailed test of hypothesis at 5% level of significance (df=25) is 1.708. Reject Ho if t > 1.708 or t1.708

Ho: 1 = 0Ho: 2 = 0Ho: 3 = 0

H1: 1 > 0H1: 2 < 0H1: 3 < 0

t = 6.25t = -3.16t = 2.04

All variables are statistically significantly at 5% level of significance.

d.Ho: 2= -1; H1: 21. We need the value of standard error of b2 (Sb ) to find the appropriate sample value of the t ratio for this Ho. Since the sample value of t, given in question 4 (as is the case with most computer programs), is given for Ho:  = 0, i.e., t = (b-0)/ Sb = 0, we can find as Sb = (b-0) / t. In our case, thus, Sb2 = 0.9902/3.16 = 0.313. The sample value of t for Ho: 2= -1 can then be found as (-0.9902-(-1)) / 0.313 = 0.031. The sample value of t = 0.031 is much smaller than the critical value of t (= 2.06). We reject Ho and conclude that there is sufficient sample evidence to show that the value of price elasticity =1 is statistically significant at 5% level of significance.

9.a.n = 40

b.4

c.R2 = 750/1250 = 0.60

d.

e.Ho: 1 = 2= 3 = 4 = 0H1: Not all ’s equal 0

= 2.65; Ho is rejected if F > 2.65

Ho is rejected. At least one i does not equal zero.

11.a.n = 26

b.R2 = 100/140 = 0.7143

c.1.4142, found by

d.Ho: 1 = 2 = 3 = 4 = 5 = 0H1: Not all ’s are 0.= 2.71;

Reject Ho if F > 2.71. Sample value of F = 20/2 =10.0. Reject Ho. At least one regression coefficient is not zero.

e.Ho is rejected in each case if t2.086 or t > 2.086. X1 and X5 may be dropped.

13.a.(i). Model #1 has all three variables, but Model #2 and Model #3 have only two

variables. However, Model #2 has the largest value of the adjusted R2 indicating that one of the variables (km) does not contribute at all to the explanation in Model #1. However, when price is regressed on km (without Age, Model #3) Km does provide a good explanation to variation in price, though not as well as the Age variable (Model #2).

(ii) The sign of Km variable is not consistent with expected sign (negative). We have a high R2 but a very low t ratio for the coefficient of km. Although VIF is not very high, simple correlation between Age and km variables is about 0.8. Since in Model #3, km does have the expected sign, we can conclude that a high degree of multicollinearity between Age and km variable is responsible for the wrong sign of the coefficient of the km variable in Model #1.

(iii) Based on the explanation given in (i) and (ii) above, the researcher is justified in her choice in dropping Km from Model #1.

b.(i) Model #2: The variables Age and Kingcab together explain nearly 88.5% variation in price. Coefficient of Age indicates that as trucks get older by a year, they lose, on average, $1606 in price, holding the Kingcab variable constant. Similarly, holding Age constant, Kingcab style trucks command, on average, a higher price by $3801.6.

(ii) Test for the Model is based on the F statistic. The value of F is 88.17 with associated P-value =0. Thus, an Ho for the model (coefficients of both variables equal zero against the H1 that one of the coefficients is non-zero) must be rejected at any positive level of significance. The critical value of t-statistic for one-sided alternative at 1% and df = 23 is 2.5. t-value for Age coefficient is |8.72|>|2.5| and t-value for Kingcab coefficient 5.62>2.5. Both coefficients are statistically significant at 1% level.

c.(i) All observations are within the 95% prediction interval indicating absence of any major outliers. A linear regression line seems to provide a good fit.

(ii) In both charts 4 and 5, the spread of observations seem larger in the approximate mid-price range. However, there is not a sufficient evidence of heteroscedasticity. An LM test for heteroscedasticity fails to reject the null hypothesis of homoscedasticity. See Notes on testing for homoscedasticity above.

(iii) Chart does not show a sufficient evidence of autocorrelation. From the print out given in your question, the sample value of the d-statistic (= 2.59), which is larger than 2, we suspect negative autocorrelation. The critical values of d-statistic (for negative autocorrelation) for k=2 and n=26 are (4-dU ) = (4-1.312) = 2.688 and (4-dL) = (4 - 1.001) = 2.999. Since the sample value of d-statistic is smaller than 2.688, we do not reject the null hypothesis of zero autocorrelation. In cross section data, we should note, examining autocorrelation based on the value of Durbin-Watson statistic is not very meaningful unless the data has been ordered in some natural manner. Based on the data set ordered according to price in ascending order, the value of d-statistic is 2.24, which supports the same conclusion! Try this as an exercise!

(iv) For normality of the error term, we expect chart 1 to be close to a straight line (except the segments at the end-points) and chart 2 to be close to a symmetric bell-shaped histogram. Neither Chart 1 resembles a straight line, nor chart 2 resembles a bell-shaped histogram. Visually, there does seem to be some departure form normality. It is, however, does not appear to be serious.

(v) Considering all charts together, there does not seem to be any serous problem with estimation or inference.However, we should note one caveat. The linear form of the model implies that the truck prices decrease, on average, by a constant amount per year. One may find this unsatisfactory, as the newer trucks may fall in prices faster than the older trucks. This type of behaviour may be captured by an additional dummy variable for first two or three years (1 for the first two or three years and 0 for other years) for the Age variable. Alternatively, use a semi-log function: regressing the log of price variable against the age and kingcab variables. In this case, the coefficient of Age would give us an average percentage change in price in response to a unit increase (one year) in Age (holding kingcab constant), and the coefficient of kingcab, would give us an average percentage change in price (holding age constant). When we tried this formulation, the results were: Log(Price) = 9.89 - 0.124 Age + 0.281 King Cab. Thus, a $20000 truck, after one year, on average, would decrease in price by $2480, and a $10000 truck by $1240. The added bonus with log formulation is that it helps to reduce heteroscedasticity, if any, by compressing the variance of the error term, nearly 10 times. If you love challenges, try this yourself and replicate your answers in this exercise with this new functional form.

15.a.(i). Model #1 has all three variables, but Model #2 does not include the dummy variablefor recession. However, Model #2 has a higher value of the adjusted R2 (59.2% compared to 56.5%) indicating that the recession variable does not contribute significantly to the explanation in Model #1.

(ii) Coefficients of both PDI and Price have expected signs. A positive sign of the coefficient of the Recession variable indicates that people tend to drink more during recession, possibly due to longer hours at home (fewer jobs during recession)! Except for the Recession variable, t-ratios are reasonable high. Both simple correlations among the independent variables and the VIF do not provide any indication of multicollinearity among the independent variables.

(iii) Yes. There is no a priori theoretical reason to include the Recession dummy in the model. Since, the dummy variable is not significant, the researcher is justified in choosing model #2.

b.(i) Model #2: The variables PDI and Price together explain nearly 59.1% variation in Volume. Coefficient of PDI indicates that a $1000 increase in PDI results in 3.34 liters increase in consumption of beer (per adult, per year), on average, holding the price of beer constant. Similarly, holding PDI constant, a dollar increase in price of beer, on average, results in a decrease in consumption of beer by 42.8 liters (per adult, per year).

(ii) Test for the Model is based on the F statistic. The value of F is 14.05 with associated P-value near zero. Ho for the model (coefficients of both variables equal zero against the H1 that at least one of the coefficients is non-zero) is rejected at any positive level of significance. The critical value of t-statistic for one-sided alternative at 1% and df = 16 is 2.583. t-value for PDI coefficient is 3.06>2.583 and t-value for Price coefficient is –5.27 < -2.583. Both coefficients are statistically significant at 1% level. This is also indicated by the P-values of 0.007 and 0.000 for these coefficients.

c.(i) No. All observations are within the 95% prediction interval indicating absence of any major outliers. A linear regression model seems to provide a good fit.

(ii) In both charts 4 and 5, the spread of observations seem larger in the lower to mid-volume range. However, there is not a sufficient evidence of heteroscedasticity. For a quantitative test, see the Notes on tesing for heteroscedasticity above.

(iii) Both charts 2 and 3 show a presence of autocorrelation. Further, the sample value of the d-statistic (= 0.97), which is smaller than 2, indicates a positive autocorrelation. The critical values of d-statistic (for positive autocorrelation) for k=2 and n=19 are dU= 1.536 and dL = 1.074. Since the sample value of d-statistic is smaller than 1.074, we reject the null hypothesis of zero autocorrelation.

(iv) For normality of the error term, we expect chart 1 to be close to a straight line (except the segments at the end-points) and chart 2 to be close to a symmetric bell-shaped histogram. Neither Chart 1 resembles a straight line, nor chart 2 resembles a bell-shaped histogram. Visually, there does seem to be some departure form normality. It does not appear to be serious. Quantitative tests of normality are discussed in detail in Chapter 16. If you are anxious, you may try the simple JB test explained in solution to question 13 above.

(v) Except for the problem of autocorrelation, there does not seem to be any serous problem with estimation. Although the estimators are unbiased, presence of autocorrelation would lead to wrong inferences. An addition of a Trend variable as one of the independent variable provides removed the autocorrelation. See the results and a note on other alternatives for testing and removing autocorrelation below.

Detailed Results of including the Trend Variable:

The values of the Trend variable in the data set are set as numbers 1, 2, 3, 4,…, n, where n is the total number of observations in the sample for all other variables.

Inclusion of Trend in the regression gave the following results. The Durbin-Watson statistic is quite high (=1.65). The critical values of the d statistic for df=4, n=19, and alpha=0.05 are dL=0.86 and dU=1.85. Thus, though the value of d is quite high it falls in the inconclusive region. To make sure we do not have autocorrelation, we used the LM test. The output is given below. The LM = (n-1) R2 =18* 0.013=0.234, which is less than Chi-Square value of 3.84 (for df=1 and alpha=.05). Thus, we have been successful in removing autocorrelation with an introduction of a trend variable. This is also evident from a very small value of t (=0.39)of the coefficient of the lagged error term (Resi-Lag). Also note that this implies that the values of coefficients and the inferences based on them from the earlier regression are not reliable. Recall, an omission of a relevant variable has serious consequences both in terms of bias and efficiency, and misleading inferences. Now the recession variable is also significant and we have much higher R2. Verify this by yourself!

Re-estimated regression with an addition of Trend variable:

The regression equation is

Volume = 91.2 + 0.00147 PDI - 6.77 Price - 5.24 Recession - 1.79 Trend

Predictor Coef SE Coef T P VIF

Constant 91.21 27.84 3.28 0.006

PDI 0.001474 0.001508 0.98 0.345 2.2

Price -6.765 4.946 -1.37 0.193 3.2

Recessio -5.237 1.651 -3.17 0.007 1.6

Trend -1.7876 0.1867 -9.58 0.000 2.8

S = 2.673 R-Sq = 95.2% R-Sq(adj) = 93.8%

Durbin-Watson statistic = 1.65

Regression Analysis for Testing Autocorrelation (LM Test): RESI22 versus PDI, Price, Recession, Trend, Resi-Lag

The regression equation is

RESI = 3.6 - 0.00014 PDI - 0.32 Price + 0.20 Recession + 0.017 Trend

+ 0.146 Resi22L

18 cases used 1 cases contain missing values

Predictor Coef SE Coef T P VIF

Constant 3.59 31.34 0.11 0.911

PDI -0.000142 0.001718 -0.08 0.935 2.3

Price -0.322 6.003 -0.05 0.958 2.7

Recessio 0.198 1.852 0.11 0.917 1.5

Trend 0.0167 0.2049 0.08 0.936 2.5

Resi-Lag 0.1456 0.3697 0.39 0.701 1.3

S = 2.866 R-Sq = 1.3% R-Sq(adj) = 0.0%

Durbin-Watson statistic = 1.78

17. Results of the regression analysis on the data set are given below:

1.The estimated regression equation is:

Price = -1580+0.44 Lot + 10148 Bedroom – 613.7 Age + 66.2 Footage + 19509 Garage

(t-values)(2.13) (1.31) (1.03) (3.4)(2.15)

+ 20266 A1 + 21620 A2.

(1.81)(1.79)

Interpretation: An increase in one square foot in lot size, on average, increases price of a house by 44 cents, assuming all other independent variables (bedroom, age, …) are held constant. An additional bedroom, on average, increases price of house by $10, 148, assuming the rest of the independent variables in the equation are held constant. Other coefficients are interpreted in similar manner.

2.ANOVA Table shows the F-value of 5.58 with a P-value of 0.0048.

3.The value of Standard Error of Estimate (Se) = 16964 indicates that, on average, the estimated price of houses deviate from the actual prices by nearly $16964. Using the empirical rule of thumb for symmetric distributions, we expect actual prices of about 95% of all houses to lie within an interval: .

4.The value of R2 is 0.765 and the value of Adj. R2 is 0.628. The value of R2 indicates that 76.5% of the sample variation in the dependent variable is explained by the independent variables in the model (based on the OLS technique). The value of Adj.R2 is different from the value of R2 because the value of Adj. R2 takes in to account the loss of degrees of freedom due to the number of coefficients estimated by the OLS technique and thus cautions the researcher against placing too much emphasis on the R2. Note, theoretically, it does not imply that the Adj. R2 from the sample data is a better estimator (of the coefficient of determination in the population) compared to the sample value of R2!

5.Ho.:, (The regression coefficients of all independent variables are simultaneously equal to zero.). H1: At least one of theis not equal to zero. The test for entire model is conducted through F statistic for the model. As shown in the ANOVA Table, the P-value for F is 0.0048 (=0.48%) is smaller than 5% level of significance. We reject Ho in favour of H1. We conclude that, at least, the coefficient of one of the independent variables is significantly different from zero.

6.We first determine the expected influence of each independent variable on the dependent variable. Our expectations are that, except for the Age variable, all other independent variables should affect price positively. We also assume that locality A2 and A3 are better compared to locality A1. We therefore write our null and alternative hypotheses as below:

Ho.:, H1:; Ho.:, H1:; Ho.:, H1:; Ho.:, H1:; Ho.:, H1:; Ho.:, H1:; Ho.:, H1:.

In this regression, the degrees of freedom are n-k-1= 20-7-1=12. The (absolute) critical value of t-statistic for one-tailed test of hypothesis for 12 degrees of freedom and 5% level of significance is 1.782. Thus, using this critical value of t, we would reject a null hypothesis if the sample value of t is greater than 1.782. Except for the regression coefficients of Bedroom and Age variables, all other regression coefficients are statistically significant. However, the t-values of the regression coefficients of Age and Bedroom are larger than one, a rule of thumb for keeping the variable in the model.

Alternatively, we can use one-half of the reported P-values (for one tailed test) and reject the null hypothesis if the level of significance is smaller than the P-value. We use the P-values (one-half of the reported P-values on the computer print out for one-tailed hypothesis) to conduct the test of hypothesis. We reach the same conclusion.

7.Given the R2 value and the information on in other parts of this question, we do not have sufficient reason to justify any problem with model specification.

8.Examining the plot of residuals, all observations seem to be within 2 times the standard error limits. There is one observation that seems like an outlier. However, it is still on the borderline of the upper limit of 2 times the standard error.

9.Based on the simple correlations, we do not see sufficient evidence of multicollinearity. The largest value of simple correlation coefficient is 0.487 between Bedroom and Lot variables.

10. The Graph of residuals does not seem to show any autocorrelation. The Durbin Watson (d) statistic for the regression 2.13 leads us to suspect negative autocorrelation. The null hypothesis of no autocorrelation will be rejected if the sample value of d were larger than 4-dL = 4-0.595 =3.405. Since the sample value of d (=2.13) is smaller than the critical value (3.405), we do not reject the null hypothesis of no (negative) autocorrelation.

11.The graph of the squared error term against the predicted price does reveal that variance is increasing with higher priced houses. To be sure, we conduct a quantitative test by regressing squared error term against all independent variables. The P-value for F-statistic in this equation is = 0.12 which is much larger than normally used 5% level of significance. Based on the quantitative test, we do not reject the null hypothesis of homoscedasticity.

12. A histogram of residuals does not indicate any serious departure from normality.

13.Based on all findings in 1 to 12, we think that the estimated values of parameters and the inferences are fairly reliable.

14.Write a descriptive report by combining all features from 1-13.

Computer Results of Regression Analysis for Problem #17, using Excel (Mega Stat) package.

Regression Analysis
(Using Mega Stat)
0.765 / : R² Adjusted R²: / 0.628
0.875 / R
16964: / std. error of estimate
20 / observations
7 / predictor variables
Price / : dependent variable
variables / coefficients / std. error / t (df=12) / p-value
intercept / B0 = / -1,580.0522
Lot / B1 = / 0.4354 / 0.2048 / 2.13 / .0550
Bedroom / B2 = / 10,148.1768 / 7,743.1585 / 1.31 / .2145
Age / B3 = / -613.7069 / 594.2509 / -1.03 / .3221
Footage / B4 = / 66.2366 / 19.5044 / 3.40 / .0053
Garage / B5 = / 19,509.7741 / 9,055.6910 / 2.15 / .0522
A1 / B6 = / 20,265.6274 / 11,173.3705 / 1.81 / .0948
A2 / B7 = / 21,620.2104 / 12,087.0209 / 1.79 / .0989
Durbin-Watson =2.13
ANOVA table
Source / SS / df / MS / F / p-value
Regression / 11,245,744,465.9007 / 7 / 1,606,534,923.7001 / 5.58 / .0048
Residual / 3,453,365,034.0993 / 12 / 287,780,419.5083
Total / 14,699,109,500.0000 / 19

Checking Multicollinearity:

Simple Correlations:

Price / Lot / Bedroom / Age / Footage / Garage / A1 / A2
Price / 1
Lot / 0.402041 / 1
Bedroom / 0.050958 / 0.487017 / 1
Age / -0.3695 / -0.07926 / 0.032287 / 1
Footage / 0.549996 / -0.10442 / -0.41749 / -0.24265 / 1
Garage / 0.576723 / 0.242987 / -0.22942 / -0.04086 / 0.335739 / 1
A1 / -0.05477 / -0.3091 / -0.24851 / 0.298395 / -0.0372 / 0.104828 / 1
A2 / -0.01155 / -0.28121 / 0.016688 / -0.33881 / -0.10052 / -0.21822 / -0.48038 / 1

Checking Heteroscedasticity: