Lecture17 Multiple Regression I Chapter 6 set 4
Diagnostics and Remedial Measures
Most of the diagnostic procedures for simple linear regression that we described in Chapter 3 carry over directly to multiple regression. Now we will review these diagnostic procedures. Many more diagnostic and remedial procedures for multiple regression will be discussed in Chapter 10 and 11.
Scatter Plot Matrix and Three-Dimensional Scatter Plots
Example: The data in this example comes from problem 6.5 of your text.
data brand;infile 'E:\teaching\STAT512\data\CH06PR05.txt';
input Y X1 X2;
run; / proc insight data=brand;
scatter Y X1 X2 * Y X1 X2;
run;
proc g3d;
scatter X1*X2=Y/noneedle shape='star' size=.3;
run;
Residual Plots
Type of graph / Departure from model (2.1) to be studiedResidual plots --
Plot against each of the predictor variables, the interaction terms of the predictor variables or . / 1. The regression function is not linear.
2. The error terms do not have constant variance.
Studentized Residuals plots --
Plot against each of the predictor variables, the interaction terms of the predictor variables or . / 1. The regression function is not linear.
2. The error terms do not have constant variance.
3. The model fits all but one or a few outlier observations.
Time-sequence plot --
Plot , or against time or some other sequence. / 1. Possible correlations between the error terms
Normal Probability Plots --
1. Plot against its expected value under the normality .
2. Plot against . / 1. The error terms are not normally distributed.
Note: is the inverse of the cumulative distribution function of , i.e. the percentile.
proc reg;
model Y=X1 X2;
OUTPUT out=brand2 R=Residual Student=Stud_Res Stdr=Std_Res;
run; / symbol1 c=green v=star width=3;
title 'residual plot against X1';
axis2 label=('Residual');
proc gplot;
plot Residual*X1=1/ vref=0 lvref=34 vaxis=axis2;
run;
Statistical tests for studying the appropriateness of the multiple model
Correlation Test for Normality and Normality Plot
Proc sort data=brand2;by Residual;
run;
data brand3;
Do k=1 to 16;
Percntle=Probit((k-.375)/16.25);
output;
end;
run;
data brand4;
merge brand2 brand3;
Exp_Res=Std_Res*percntle;
run; / Proc corr Noprob data=brand4;
var residual Exp_Res Stud_Res Percntle;
run;
symbol2 i=spline c=green v=none width=1;
symbol3 c=green value=star width=3;
axis2 label=('Studentized' j=r 'Residual');
title 'Normal probability plot';
proc gplot data=brand4;
plot Stud_Res*Percntle=3 Percntle*Percntle=2/overlay vaxis=axis2 ;
run;
Pearson Correlation Coefficients, N = 16
Residual Exp_Res Stud_Res Percntle
Residual 1.00000 0.99249 0.99957 0.99292
Exp_Res 0.99249 1.00000 0.99143 0.99956
Stud_Res 0.99957 0.99143 1.00000 0.99272
Percntle 0.99292 0.99956 0.99272 1.00000
Note: n=16 in our example,.
Brown-Forsythe test for Constancy of Error Variance
This test can be used readily in multiple regression when the error variances increase or decrease with one of the predictor variable. To conduct this test, we divide the data set into 2 groups, where one group consists of cases where the levels of the predictor variable (or the fitted response) are relatively low and the other group consists of cases where the levels of the predictor variable (or the fitted response) are relatively high.
, under Constancy variance.
Where, , , .
Breusch-Pagan test for Constancy of Error Variance
For the data in our example one possible BP test is
, where is for the full multiple regression model and is the regression sum of square for model .
And under Constancy variance, .
data bp;set brand2;
res_sq=residual**2;
run;
Proc Reg data=bp;
model res_sq = X1 X2;
run;
proc reg data=brand;
model Y=X1 X2;
run; / data;
SSR_Star=72.407;
SSE=94.3;
N=16;
BP=(SSR_Star/2)/((SSE/N)**2);
PV=1-probchi(BP,2);
keep N SSE SSR_Star BP PV;
Run;
Proc print;
run;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 1872.70000 936.35000 129.08 <.0001
Error 13 94.30000 7.25385
Corrected Total 15 1967.00000
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 72.40700 36.20350 0.95 0.4113
Error 13 494.34723 38.02671
Corrected Total 15 566.75423
Obs N SSE SSR_Star BP PV
1 16 94.3 72.407 1.04224 0.59386
F Test for Lack of Fit
Consider the following testing hypotheses:
versus .
The appropriate test statistics is:
.
SSE(R) is the SSE of the following model:
SSE(F) is the SSE of the full model:
with .
Where, C is the number of groups divided in the data. In each group, the levels of X vector are the same or close. (The values of each X variable are the same or close).
The formula to calculate SSE(F) is:
.
The SAS code to obtain SSE(F) is:
Proc GLM;
Class X1, X2, …, ;
Model Y=X1*X2*…*;
Run;
Under,
Decision rule 1 / Decision rule 2Accept if
Reject if / Accept if
Reject if
In our following example, c=8, p=3 and N=16. Hence Under,
Proc reg data=brand;Model Y=X1 X2;
run;
Proc GLM data=brand;
class X1 X2;
Model Y=X1*X2;
run; / data ;
SSPE=57; SSE=94.3;
SSLF=SSE-SSPE; F=(SSLF/5)/(SSPE/8);
PV=1-probF(F,5,8);
keep SSPE SSLF F PV;
Run;
proc print;
run;
The REG Procedure
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 1872.70000 936.35000 129.08 <.0001
Error 13 94.30000 7.25385
Corrected Total 15 1967.00000
Root MSE 2.69330 R-Square 0.9521
Dependent Mean 81.75000 Adj R-Sq 0.9447
Coeff Var 3.29455
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 37.65000 2.99610 12.57 <.0001
X1 1 4.42500 0.30112 14.70 <.0001
X2 1 4.37500 0.67332 6.50 <.0001
The GLM Procedure
Dependent Variable: Y
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 7 1910.000000 272.857143 38.30 <.0001
Error 8 57.000000 7.125000
Corrected Total 15 1967.000000
R-Square Coeff Var Root MSE Y Mean
0.971022 3.265162 2.669270 81.75000
Source DF Type I SS Mean Square F Value Pr > F
X1*X2 7 1910.000000 272.857143 38.30 <.0001
Source DF Type III SS Mean Square F Value Pr > F
X1*X2 7 1910.000000 272.857143 38.30 <.0001
Obs SSPE SSLF F PV
1 57 37.3 1.04702 0.45301
Comment: When replications are not available, an approximate lack of fit test can be conducted if there are cases that have similar vectors. These cases are grouped together and treated as pesudoreplicates, and the test for lack of fit is then carry out using these groupings of similar cases.
Remedial Measures
1. The multiple regression can be expanded to include squared and higher-order terms of the predictor variable to recognize curvature effects.
2. The multiple regression can be expanded to include the interaction terms of the predictor variables.
3. Transformations on the response and/or the predictor variables can be made, following the principles discussed in chapter 3. Transformation on the response may be helpful when the error distribution is skewed and the variance of the error terms is not constant. Transformation on predictor variables may be helpful when the effects of these variables are curvilinear.
4. Box-Cox procedure for determining an appropriate power transformation on Y is also applicable to multiple regression models.