Lecture17 Multiple Regression I Chapter 6 set 4

Diagnostics and Remedial Measures

Most of the diagnostic procedures for simple linear regression that we described in Chapter 3 carry over directly to multiple regression. Now we will review these diagnostic procedures. Many more diagnostic and remedial procedures for multiple regression will be discussed in Chapter 10 and 11.

Scatter Plot Matrix and Three-Dimensional Scatter Plots

Example: The data in this example comes from problem 6.5 of your text.

data brand;
infile 'E:\teaching\STAT512\data\CH06PR05.txt';
input Y X1 X2;
run; / proc insight data=brand;
scatter Y X1 X2 * Y X1 X2;
run;
proc g3d;
scatter X1*X2=Y/noneedle shape='star' size=.3;
run;

Residual Plots

Type of graph / Departure from model (2.1) to be studied
Residual plots --
Plot against each of the predictor variables, the interaction terms of the predictor variables or . / 1.  The regression function is not linear.
2.  The error terms do not have constant variance.
Studentized Residuals plots --
Plot against each of the predictor variables, the interaction terms of the predictor variables or . / 1. The regression function is not linear.
2. The error terms do not have constant variance.
3. The model fits all but one or a few outlier observations.
Time-sequence plot --
Plot , or against time or some other sequence. / 1. Possible correlations between the error terms
Normal Probability Plots --
1. Plot against its expected value under the normality .
2. Plot against . / 1.  The error terms are not normally distributed.
Note: is the inverse of the cumulative distribution function of , i.e. the percentile.
proc reg;
model Y=X1 X2;
OUTPUT out=brand2 R=Residual Student=Stud_Res Stdr=Std_Res;
run; / symbol1 c=green v=star width=3;
title 'residual plot against X1';
axis2 label=('Residual');
proc gplot;
plot Residual*X1=1/ vref=0 lvref=34 vaxis=axis2;
run;

Statistical tests for studying the appropriateness of the multiple model

Correlation Test for Normality and Normality Plot
Proc sort data=brand2;
by Residual;
run;
data brand3;
Do k=1 to 16;
Percntle=Probit((k-.375)/16.25);
output;
end;
run;
data brand4;
merge brand2 brand3;
Exp_Res=Std_Res*percntle;
run; / Proc corr Noprob data=brand4;
var residual Exp_Res Stud_Res Percntle;
run;
symbol2 i=spline c=green v=none width=1;
symbol3 c=green value=star width=3;
axis2 label=('Studentized' j=r 'Residual');
title 'Normal probability plot';
proc gplot data=brand4;
plot Stud_Res*Percntle=3 Percntle*Percntle=2/overlay vaxis=axis2 ;
run;

Pearson Correlation Coefficients, N = 16

Residual Exp_Res Stud_Res Percntle

Residual 1.00000 0.99249 0.99957 0.99292

Exp_Res 0.99249 1.00000 0.99143 0.99956

Stud_Res 0.99957 0.99143 1.00000 0.99272

Percntle 0.99292 0.99956 0.99272 1.00000

Note: n=16 in our example,.

Brown-Forsythe test for Constancy of Error Variance

This test can be used readily in multiple regression when the error variances increase or decrease with one of the predictor variable. To conduct this test, we divide the data set into 2 groups, where one group consists of cases where the levels of the predictor variable (or the fitted response) are relatively low and the other group consists of cases where the levels of the predictor variable (or the fitted response) are relatively high.

, under Constancy variance.

Where, , , .

Breusch-Pagan test for Constancy of Error Variance

For the data in our example one possible BP test is

, where is for the full multiple regression model and is the regression sum of square for model .

And under Constancy variance, .

data bp;
set brand2;
res_sq=residual**2;
run;
Proc Reg data=bp;
model res_sq = X1 X2;
run;
proc reg data=brand;
model Y=X1 X2;
run; / data;
SSR_Star=72.407;
SSE=94.3;
N=16;
BP=(SSR_Star/2)/((SSE/N)**2);
PV=1-probchi(BP,2);
keep N SSE SSR_Star BP PV;
Run;
Proc print;
run;

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 1872.70000 936.35000 129.08 <.0001

Error 13 94.30000 7.25385

Corrected Total 15 1967.00000

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 72.40700 36.20350 0.95 0.4113

Error 13 494.34723 38.02671

Corrected Total 15 566.75423

Obs N SSE SSR_Star BP PV

1 16 94.3 72.407 1.04224 0.59386

F Test for Lack of Fit

Consider the following testing hypotheses:

versus .

The appropriate test statistics is:

.

SSE(R) is the SSE of the following model:

SSE(F) is the SSE of the full model:

with .

Where, C is the number of groups divided in the data. In each group, the levels of X vector are the same or close. (The values of each X variable are the same or close).

The formula to calculate SSE(F) is:

.

The SAS code to obtain SSE(F) is:

Proc GLM;

Class X1, X2, …, ;

Model Y=X1*X2*…*;

Run;

Under,

Decision rule 1 / Decision rule 2
Accept if
Reject if / Accept if
Reject if

In our following example, c=8, p=3 and N=16. Hence Under,

Proc reg data=brand;
Model Y=X1 X2;
run;
Proc GLM data=brand;
class X1 X2;
Model Y=X1*X2;
run; / data ;
SSPE=57; SSE=94.3;
SSLF=SSE-SSPE; F=(SSLF/5)/(SSPE/8);
PV=1-probF(F,5,8);
keep SSPE SSLF F PV;
Run;
proc print;
run;

The REG Procedure

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 1872.70000 936.35000 129.08 <.0001

Error 13 94.30000 7.25385

Corrected Total 15 1967.00000

Root MSE 2.69330 R-Square 0.9521

Dependent Mean 81.75000 Adj R-Sq 0.9447

Coeff Var 3.29455

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 37.65000 2.99610 12.57 <.0001

X1 1 4.42500 0.30112 14.70 <.0001

X2 1 4.37500 0.67332 6.50 <.0001

The GLM Procedure

Dependent Variable: Y

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 7 1910.000000 272.857143 38.30 <.0001

Error 8 57.000000 7.125000

Corrected Total 15 1967.000000

R-Square Coeff Var Root MSE Y Mean

0.971022 3.265162 2.669270 81.75000

Source DF Type I SS Mean Square F Value Pr > F

X1*X2 7 1910.000000 272.857143 38.30 <.0001

Source DF Type III SS Mean Square F Value Pr > F

X1*X2 7 1910.000000 272.857143 38.30 <.0001

Obs SSPE SSLF F PV

1 57 37.3 1.04702 0.45301

Comment: When replications are not available, an approximate lack of fit test can be conducted if there are cases that have similar vectors. These cases are grouped together and treated as pesudoreplicates, and the test for lack of fit is then carry out using these groupings of similar cases.

Remedial Measures

1.  The multiple regression can be expanded to include squared and higher-order terms of the predictor variable to recognize curvature effects.

2.  The multiple regression can be expanded to include the interaction terms of the predictor variables.

3.  Transformations on the response and/or the predictor variables can be made, following the principles discussed in chapter 3. Transformation on the response may be helpful when the error distribution is skewed and the variance of the error terms is not constant. Transformation on predictor variables may be helpful when the effects of these variables are curvilinear.

4.  Box-Cox procedure for determining an appropriate power transformation on Y is also applicable to multiple regression models.