Name ______
STAT 5372 – Lab Assignment #3
1. Access the SAS file cardata.sas from your e-mail or from my website http://faculty.smu.edu/waynew/stat5372s05.htm
2. Run PROC MEANS to assure that you have properly accessed the data.
(For example, there are 428 observations on horsepower (HP) and the mean is 215.8855. There are 414 readings on highway miles per gallon (HMPG) and the mean is 26.9058.)
3. Find the correlation coefficients among the variables HP, HMPG, and CMPG (city miles per gallon).
PROC CORR;
VAR list of variables;
RUN;
(see Cody and Smith, 116-118.)
Fill in the blanks:
Correlation p-value
HP vs HMPG ______
HP vs CMPG ______
HMPG vs CMPG ______
On the basis of these results, are these correlations significantly different from zero? Why or why not?
4. Use PROC GLM to find the regression line for predicting HMPG from HP using the commands
PROC GLM;
MODEL hmpg=hp;
OUTPUT out=new r=resid;
RUN;
Write the equation of the regression line below.
Equation:
Is there a significant relationship between HMPG and HP?
t-value = ______p-value = ______
5. Use GPLOT to plot a scatter plot for HPMG vs HP along with a plot of RERSID vs HP
(See PROC PLOT, page 42- in Cody and Smith and last week’s class notes.)
Sketch the plots in the right-hand margin. │
hmpg│
└─────
hp
On the basis of these plots:
Does the relationship appear to be linear? Explain.
│
resid│
└─────
hp
Do there appear to be any outliers? Explain.
6. You may notice that 3 cars have HMPG greater than 50. These cars are hybrid electric models. Since use of such cars in an analysis of the relationship between HP and HMPG might not be appropriate, let’s consider the dataset containing only those HMPG readings less than 50. Letting cardata created in GLM denote the original data set name, you can create a “reduced” data set cardata2 using the commands: (CS 319 - )
DATA cardata2;
SET cardata;
if hmpg < 50;
RUN;
On this reduced data set run these analyses: │
1. Use GLM to predict HMPG from HP. hmpg│
2. Plot scatter plots of HMPG vs HP and of resid vs HP and sketch them here.└─────
hp
Is there evidence that the linear relationship has been strengthened?
Explain. │
resid│
└─────
hp
7. is used as a measure of the strength of a relationship in regression analysis. is given by and measures the proportion of the variability in Y that is explained by the regression. Clearly, the higher is, the better X does in explaining the variability of Y and thus the stronger the linear relationship. (In simple linear regression, is simply the square of the correlation coefficient.) From the GLM output, ______% of the variability in HMPG is explained by its regression with HP.
8. There is still some evidence of nonlinearity in the relationship between HP and HMPG. In order to improve the linearity, let’s consider transforming the X variable (HP). That is, even though there is a nonlinear relationship between HP and HMPG, there may be a transformation of HP for which the relationship is more nearly linear.
In order to examine these relationships, consider the following new X variables:
1.
2.
3.
You can use the following commands to set up a new data set (cardata3) containing the new variables starting with the “reduced” data set (cardata2).
DATA cardata3;
SET cardata2;
hp2=hp*hp;
hpsq=sqrt(hp);
hplog=log(hp);
RUN;
9. Find the equations of the 3 separate regression lines (using GLM) for predicting HMPG from HP2, HPSQ, and HPLOG. Write these regression equations below along with:
(a) Using ││
hmpg│resid│
└─────└─────
hp2 hp2
= ______
Equation:
(b) Using ││
hmpg│resid│
└─────└─────
√hp √hp
= ______
Equation:
(c) Using ││
hmpg│resid│
└─────└─────
log(hp) log(hp)
= ______
Equation:
10. Sketch the following plots on the axes in 9:
(a) scatter plots of HMPG vs the new independent variable
(b) residual plots
What can you say about these new independent variables from these plots, i.e. which appear to improve performance and which appear to hurt the linearity of the relationship?
11. A special type of multiple regression involves simultaneously using the variables X and X2 in the model. (That is, we are able to consider a general parabolic form for the prediction of Y.) This can be accomplished using the GLM command
PROC GLM;
MODEL hmpg=hp hp2;
OUTPUT out=new22 r=resid22;
RUN;.
(a) For this model, what is? ______. Compare this with the models in 9.
(b) Sketch the residual plots below (vs hp and hp2 separately). How do these plots compare with the residual plots you found in 9 and 10?
││
resid│ resid│
└───── └─────
hp hp2