Use the High School and Beyond (HSB) Data Set

Homework 6

Covers Chapters 9 and 10

Use the High School and Beyond (HSB) data set.

The data is explained in the HSB Read Me file.

USE MATH AS PREDICTOR

1. Some researchers feel an interaction exists between Gender and Writing ability. Create an Gender*Writing interaction term. Based on the 14 predictor variables, how many models are possible (assume an intercept for all models)?

214 - 1 = 16,383

2. Using Math as predictor, analyze the data using Minitab Backward, Forward, and Stepwise Regression (keep default settings). Specify the “best” regression equation identified by these three methods. How many steps did it take for each method? Do they agree?

Backward: NOTE – because of interaction Writing was centered, but not Gender since it is categorical and we do not center such variables.

MATH = 24.0 - 1.42 SEX + 0.835 SES + 0.256 RDG + 0.222 SCI + 0.0686 CIV

+ 0.281 cWRTG

Steps: 9

Forward:

MATH = 23.5 + 0.254 RDG + 0.281cWRTG + 0.220 SCI - 1.41 SEX + 0.779 SES

+ 0.0691 CIV + 0.0931 CAR

Steps: 7

Stepwise:

MATH = 23.5 + 0.254 RDG + 0.281cWRTG + 0.220 SCI - 1.41 SEX + 0.779 SES

+ 0.0691 CIV + 0.0931 CAR

Steps: 7

Agree? No, they do not agree.

3. In the Backward elimination analysis, which variable was removed first and why?

The variable School Type (SCTYP) is removed first at is has the highest p-value (or lowest correlation to Math) when Math is regressed on the full model.

4. In the Forward and Stepwise analyses which variable entered first and why?

The variable Reading enters first as it has the lowest p-value (or highest correlation to Math) when computing all of the simple linear regression models.

5. Based on your output from the Forward Selection analysis if you would have specified 0.15 as the alpha level for entry would any variable(s) been left out of the model from the model using the default settings? If so, which variable(s) and what information helps you answer this question?

No, the result would have been the same.

6. In the Backward Elimination analysis how much of a change in R2 is there between the model in Step 4 and the final model?

In Step 4 the R-squared is 58.35% and for the final model R-squared is 58.03%

7. Now regress Math on all of the predictors and use the Best Subsets in Minitab to determine the variables that comprise the best model using Cp closest to p, Lowest Cp, R-squared, and adjusted R-squared. What are the variables, criterion values, and are the models the same? [Remember that goal is reduce number of variables from full model.]

Cp closest to p: Gender, SES, RDG, SCI, cWRTG

Value: Cp of 5.6

Lowest Cp: Gender, SES, CAR, RDG, SCI, CIV, cWRTG

Value: Cp of 3.1

R-squared: 12 variables total – includes all but Race and SCTYP

Value: R-squared of 58.4%

Adj. R-Squared: Gender, SES, CAR, RDG, SCI, CIV, cWRTG

Value: adj. R-squared of 57.7%

All models the same? No, although the lowest Cp and adjusted R-squared do select the same variables.

8. Regress Math on Reading, Writing and Civics. Click Storage and select Cook’s Distance (Di). Determine if any of these Di value(s) indicate if any observation(s) as influential by seeing if any of these Di values exceed 0.5 of the F-distribution with p and n-p degrees of freedom. That is, find the cumulative F probability for this column of Di values. If any cumulative probabilities exceed 0.5 then that observation would be considered and outlier. Also, in the output under Unusual Observations any observation marked with an “X” indicates and influential outlier. Do any exist in this regression analysis?

DF: 4 and 596

Number of Di values greater than 0.5: No, no cumulative F-probabilities exceed 0.5

Observations that are considered influential outliers: 45, 247, 284, 324, and 592