Class Handout #7 Homework Name ______

Class Handout #7 Homework Name ______

Exercise #1

The data stored in the SPSS data file paint is to be used in a study concerning the prediction of the drying time (hours) for an outdoor house paint. Temperature (degrees Fahrenheit), humidity (percent), wind velocity (miles per hour), and barometric pressure are to be considered as possible predictors, and the 22house painting jobs selected for the data set are a random sample.

(a)Does the data appear to be observational or experimental?

Since the temperature, humidity, wind velocity, and barometric pressure are all random, the data is observational.

(b)In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to do each of the following:

Follow the instructions in the first six steps to graph the least squares line on a scatter plot for the dependent variable with each quantitative independent variable; then decide whether or not the linearity assumption appears to be satisfied.

For each of the quantitative predictors, the relationship looks reasonably linear, since the data points appear randomly distributed around the least squares line.

Continue to follow the instructions beginning with the 8th step (notice that step 7 is not necessary here) down to the 15th step to create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, decide whether or not each of these assumptions appears to be satisfied.

There appears to be much variation, but it looks reasonably uniform.

The histogram of standardized residuals does not look too different from a normal, bellshaped distribution, and the points on the normal probability plot do not seem to depart too much from the diagonal line.

Exercise #1 - continued

Class Handout #7 Homework Name ______

Exercise #1 - continued

(c)Based on the results in part (b), explain why we feel it is appropriate to proceed with the regression analysis.

The results in part (b) suggest that the linearity assumption, the uniform variance (homoscedasticity), and thenormality assumption all appear to be satisfied.

(d)From the Correlations table of the SPSS output comment on the possibility of multicollinearity in the multiple regression.

Since the correlation matrix does not contain any correlation greater than 0.8 for any pair of independent variables, there is no indication that multicollinearity will be a problem.

(e)With a 0.05 significance level, summarize the results of the f-test in the ANOVA table.

Since f4, 17 = 9.881 and f4, 17; 0.05 = 2.96, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the linear regression to predict paint drying timefromtemperature, humidity, wind velocity, and barometric pressure is significant (p < 0.001).

(f)Use the SPSS output to find the least squares regression equation.

drying time = 556.576 0.265(temperature)  0.113(humidity)

 0.641(wind velocity) 0.639(barometric pressure)

(g)In the document titled Using SPSS Version 19.0, use SPSS with the five instructions at the end of the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to obtain the output for a stepwise regression.

Exercise #1 - continued

Class Handout #7 Homework Name ______

Exercise #1 - continued

(h)From the Collinearity Statistics section of the Coefficients table of the SPSS output, add to the comment on the possibility of multicollinearity in the multiple regression.

We see that tolerance > 0.10 (i.e., VIF < 10) for each independent variable, which is a further indication that multicollinearity will not be a problem.

(i)From the Variables Entered/Removed table of the SPSS output, find the default values of the significance level to enter an independent variable into the model and the significance level to remove an independent variable from the model.

Respectively these are 0.05 and 0.10.

(j)From the Variables Entered/Removed table of the SPSS output, find the number of steps in the stepwise multiple regression, and list the independent variables selected and removed at each step.

There were two steps in the stepwise multiple regression; the variable temperature was entered in the first step, and the variable wind velocity was entered in the second step. No variables were removed at either step.

(k)From the Correlations table of the SPSS output, find the ordinary correlation between the dependent variable paint drying time and the first independent variable entered into the model.

The correlation between paint drying time and temperature is 0.588.

Exercise #1 - continued

(l)From the Excluded Variables table of the SPSS output, find the partial correlation between the dependent variable paint drying time and the second independent variable entered into the model given the first independent variable entered into the model; compare this to the ordinary correlation between the dependent variable paint drying time and the second independent variable entered into the model, which can be found from the Correlations table of the SPSS output.

The partial correlation between paint drying time and wind velocity given temperature is 0.655.

The ordinary correlation between paint drying time and wind velocity is0.184.

(m)From the Model Summary table of the SPSS output, find and interpretthe change(s) in R2 from the model at one step to the next step.

From the model at Step 1, we see that temperature accounts for 34.6% of the variance in paint drying time.

From the model at Step 2, we see that temperature and wind velocity together account for 62.6% of the variance in paint drying time. With temperature already in the model, wind velocityaccounts for an additional 28.0% of the variance in paint drying time.

(n)From the Coefficients table of the SPSS output, write the estimated regression equation for each step.

Step 1:drying time = 61.410  0.340(temperature)

Step 2:drying time = 74.441  0.397(temperature)

 0.628(wind velocity)

Exercise #1 - continued

(o)Use the estimated regression equation from the final step of the stepwise multiple regression to predict the paint drying timewith a temperature of 75 degrees Fahrenheit, a relative humidity of 55%, a wind velocity of 15 miles per hour, and a barometric pressure of 759.78.

drying time = 74.441  0.397(75)  0.628(15) = 35.246 hours

(p)For each of the estimated regression coefficients in the estimated regression equation from the final step of the stepwise multiple regression, write a one sentence interpretation describing what the coefficient estimates.

For each increase of one degree Fahrenheit in temperature, the drying timedecreases on average by about 0.397 hours.

For each increase of one mile per hour in wind velocity, the drying timedecreases on average by about 0.628 hours.

Exercise #2

The data stored in the SPSS data file wheat_yield is to be used in a study concerning the prediction of the wheat yield (bushels per acre)with possible predictorstotal rainfall (inches), average temperature (degrees Fahrenheit), and type of soil; there are three types of soil labeled A, B, and C. The data are from randomly selected observations over several seasons.

(a)List the independent variables, and indicate whether each is quantitative or qualitative.

total rainfallquantitative

average temperaturequantitative

soil typequaltitative

(b)Define dummy variables to represent each qualitative independent variable; make the number of dummy variables defined equal to the number of categories, but then comment on the number of these dummy variables that would be sufficient.

1 for soil type A1 for soil type B1 for soil type C

X1 =X2 =X3 =

0 otherwise0 otherwise0 otherwise

Any two of these dummy variables is sufficient to represent the qualitative independent variable soil type.

(c)In the document titled Using SPSS Version 19.0, use SPSS with the section titled Creating new variables by recoding existing variables to recode the variable soil type into the first dummy variable in part (b); then repeat this for the other dummy variable(s) in part (b).

Exercise #2 - continued

(d)In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to do each of the following:

For each of the quantitative predictors, the relationship looks reasonably linear, since the data points appear randomly distributed around the least squares line.

Continue to follow the instructions beginning with the 8th step (notice that step 7 was already done in part (c)) down to the 15th step to create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, decide whether or not each of these assumptions appears to be satisfied.

The variation looks reasonably uniform.

The histogram of standardized residuals looks somewhat skewed, and the points on the normal probability plot seem to depart somewhat from the diagonal line.

Exercise #2 - continued

(e)Based on the histogram and normal probability plot for the standardized residuals in part (b), explain why we might want to look at the skewness coefficient, the kurtosis coefficient, and the results of the Shapiro-Wilk test. Then use SPSS with the section titled Data Diagnostics to make a statement about whether or not non-normality needs to be a concern.

Since there appears to be some possible evidence of non-normality in part (d), we want to know if non-normality needs to be a concern.

Since the skewness and kurtosis coefficients are each within three standard errors of zero, and the p = 0.128 is not less than 0.001 in the Shapiro-Wilk test, non-normality need not be a concern in the regression.

(f)From the Correlations table of the SPSS output comment on the possibility of multicollinearity in the multiple regression.

The correlation matrix containsone correlation between independent variables a little greater than 0.8, suggesting a slight possibility that multicollinearity could be a problem.

(g)With a 0.05 significance level, summarize the results of the f-test in the ANOVA table.

Since f4, 34 = 39.149 and f4, 34; 0.05 = 2.88, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the linear regression to predict wheat yield from total rainfall, average temperature, and type of soil is significant (p < 0.001).

Exercise #2 - continued

(h)Usethe SPSS output to find the least squares regression equation; then explain why all the dummy variables were not included in the model.

yield =  32.967 + 0.699(rain) + 0.156(temp) +19.312(X2) + 16.182(X3)

The qualitative independent variable soil type has 3 categories, and only 2 dummy variables are needed to represent a qualitative variable with 3 categories.

(i)Indicate how the least squares regression equation in part (h) describes a separate regression equation for each category of the qualitative independent variable.

The 2 dummy variables from part (b) that were included in the model are X2 and X3 .

For soil type A, X2 = X3 = 0 so that the least squares regression equation is

yield =  32.967 + 0.699(rain) + 0.156(temp)

For soil type B, X2 = 1 and X3 = 0 so that the least squares regression equation is

yield =  13.655 + 0.699(rain) + 0.156(temp)

For soil type C, X2 = X3 = 0 so that the least squares regression equation is

yield =  16.785 + 0.699(rain) + 0.156(temp)

Class Handout #7 Homework Name ______

Exercise #2 - continued

(j)Inthe document titled Using SPSS Version 19.0, use SPSS with the five instructions at the end of the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to obtain the output for a stepwise regression.

(k)From the Collinearity Statistics section of the Coefficients table of the SPSS output, add to the comment on the possibility of multicollinearity in the multiple regression.

We see that tolerance > 0.10 (i.e., VIF < 10) for each independent variable, which is an indication that multicollinearity will not be a problem.

(l)From the Variables Entered/Removed table of the SPSS output, find the default values of the significance level to enter an independent variable into the model and the significance level to remove an independent variable from the model.

Respectively these are 0.05 and 0.10.

(m)From the Variables Entered/Removed table of the SPSS output, find the number of steps in the stepwise multiple regression, and list the independent variables selected and removed at each step.

There were three steps in the stepwise multiple regression; the variable rainfall was entered in the first step, the dummy variable X1was entered in the second step, and the variable temperature was entered in the third step. No variables were removed at any step.

Exercise #2 - continued

(n)From the Model Summary table of the SPSS output, find and interpret the change(s) in R2 from the model at one step to the next step.

From the model at Step 1, we see that rainfall accounts for 41.1% of the variance in wheat yield.

From the model at Step 2, we see that rainfall and the indicator variable for soil type A together account for 78.5% of the variance in wheat yield. With rainfall already in the model, the indicator variable for Soil Type A accounts for an additional 37.4% of the variance in wheat yield.

From the model at Step 3, we see that rainfall, the indicator variable for soil type A, and temperature together account for 80.8% of the variance in wheat yield. With rainfall and the indicator variable for soil type A already in the model, temperature accounts for an additional 2.3% of the variance in wheat yield.

(o)From the Coefficients table of the SPSS output, write the estimated regression equation for each step.

Step 1

yield = 9.897 + 0.313(rain)

Step 2

yield = 9.503 + 0.762(rain) 20.065(X1)

Step 3

yield = 16.571 + 0.691(rain) + 0.185(temp) 17.352(X1)

Exercise #2 - continued

For each increase of one inch in total rainfall, wheat yield increases on average by about 0.691 bushels per acre.

For each increase of one degree Fahrenheit in temperature, wheat yield increases on average by about 0.185 bushels per acre.

On average, wheat yield is about 17.352bushels per acre smaller with soil type A than for other soil types.

(q)Indicate how the estimated regression equation from the final step of the stepwise multiple regression to predict wheat yield describes separate regression equations for different groups.

For soil type A, X1 = 1 so that the least squares regression equation is

yield =  33.923 + 0.691(rain) + 0.185(temp)

For soil types B or C, X1 =0 so that the least squares regression equation is

yield =  16.571 + 0.691(rain) + 0.185(temp)

Exercise #2 - continued

(r)Use the estimated regression equation from the final step of the stepwise multiple regression to predict wheat yield in each of the following scenarios:

Total rainfall is 60 inches, average temperature is 65 degrees Fahrenheit, and soil type A is used.

Total rainfall is 60 inches, average temperature is 65 degrees Fahrenheit, and soil type B or C is used.

Exercise #2 - continued

Class Handout #7 Homework Name ______

Exercise #3

The data stored in the SPSS data file aerobic is taken on randomly selected subjects in a study to predict IgG (milligrams of immunoglobulin in blood), which is an indicator of long-term immunity, from maximal oxygen uptake (milliliters per kilogram), which is a measure of aerobic fitness level.

(a)Identify the dependent (response) variable and the independent (explanatory) variable for a regression analysis.

The dependent (response) variable is IgG, and the independent (explanatory) variable is maximal oxygen uptake.

(b)Does the data appear to be observational or experimental?

The researcher has no control over the independent (explanatory) variable maximal oxygen uptake, making this observational data.

(c)In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a simple linear regression with bivariate data, with checks of linearity, homoscedasticity, and normality assumptions to do the following:

Follow the instructions in the first five steps to graph the least squares line on a scatter plot; then state why it might appear that the linearity assumption is not satisfied.

The data points do not appear to be randomly distributed around the least squares line. As maximal oxygen uptake increases, the IgG appears to increase at a slower rate.

(d)In the document titled Using SPSS Version 19.0, use SPSS with the section titled Creating new variables with transformation of existing variables to create a new variable named maxoxysq which is equal to the square of the variablemaximal oxygen uptake (namedmaxoxy in the data file).

Exercise #3 - continued

(e)Both maxoxy and maxoxysq will be included in the regression model. In the document titled Using SPSS Version 19.0, use SPSS with steps 8 to 15 in the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptionsto create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, state why each of these assumptions appears to be satisfied.

The variation looks reasonably uniform.