Analysis of Covariance (ANCOVA)

Analysis of Covariance (ANCOVA)

Using SAS

(commands = day3_finan_ancova.sas)

The commands in this handout use the dataset CARS.sas7bdat.

/*************************************

ANALYSIS OF COVARIANCE (ANCOVA)

COMMANDS=DAY3_FINAN_ANCOVA.SAS

**************************************/

/*Specify generic output using Options Formchar= */

options formchar="|----|+|---+=|-/\>*";

title;

/* INDICATE LIBRARY CONTAINING PERMANENT SAS DATA SET "CARS" */

libname sasdata2 V9 "C:\temp\sasdata2";

/* FROM REGRESSION ANALYSIS SECTION WE KNOW THAT THE LOG

TRANSFORMED DEPENDENT VARIABLE LOGMPG PERFORMS BETTER

THAN THE ORIGINAL VARIABLE MPG.

THE OUTLIER WITH YEAR=0 IS EXCLUDED FROM THE DATASET. */

data cars;

set sasdata2.cars;

logmpg = log(mpg);

if year ne 0;

run;

/* VEHICLE WEIGHT IS A SIGNIFICANT PREDICTOR FOR MPG. */

procreg data=cars;

model logmpg = weight;

run; quit;

The REG Procedure

Model: MODEL1

Dependent Variable: logmpg

Number of Observations Read 405

Number of Observations Used 397

Number of Observations with Missing Values 8

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 34.37453 34.37453 1280.17 <.0001

Error 395 10.60636 0.02685

Corrected Total 396 44.98090

Root MSE 0.16386 R-Square 0.7642

Dependent Mean 3.10366 Adj R-Sq 0.7636

Coeff Var 5.27971

Parameter Estimates

Parameter Standard

Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 4.13994 0.03011 137.50 <.0001

WEIGHT Vehicle Weight (lbs.) 1 -0.00034939 0.00000977 -35.78 <.0001

In ANOVA section we saw that the MPG is also significantly related to ORIGIN. When we add ORIGIN to the regression model along with WEIGHT, we are actually fitting an Analysis of covariancemodel. Analysis of covariance, or ANCOVA, is the name given to a linear model when the predictors include a combination of continuous variables and categorical variables.

In this section we are showing you two ways of fitting an ANCOVA model in SAS: Proc Reg and Proc GLM (General Linear Models).

/* CREATE DUMMY VARIABLES FOR ORIGIN */

data cars;

set cars;

if origin ne . then do;

American = (origin = 1);

European = (origin = 2);

Japanese = (origin = 3);

end;

run;

/* CHECK THE RECODING */

procfreq data=cars;

tables origin American European Japanese;

run;

procsort data=cars;

by origin;

run;

procprint data=cars;

var logmpg weight origin American European Japanese;

run;

/* SCATTERPLOT OF LOGMPG AGAINST WEIGHT FOR EACH ORIGIN */

goptions reset=all;

symbol1 color=black value = dot line=1 interpol=rl;

symbol2 color=black value = plus line=2 interpol=rl;

symbol3 color=black value = circle line=3 interpol=rl;

procgplot data=cars;

plot logmpg * weight= origin ;

run; quit;

/* THE SCATTERPLOT SUGGESTS THAT THERE MAY BE AN INTERACTION

BETWEEN ORIGIN AND WEIGHT.*/

/* CREATE INTERACTION TERMS.*/

data cars2;

set cars;

ame_wt = American * weight;

eur_wt = European *weight;

jap_wt = Japanese * weight;

run;

When we fit an ANCOVA model in PROC REG, we need to create dummy variables for the categorical variable. Suppose a categorical variable has m levels. Only m-1 dummy variables are needed for the regression model. The dummy variable being left out is set as the reference level.

/* ANCOVA IN PROC REG: INCLUDES ONE CONTINUOUS VARIABLE, TWO DUMMY VARIABLES FOR ORIGIN, AND THE INTERACTIONS. DUMMY VARIABLE JAPANESE IS LEFT OUT AND ISAUTOMATICALLY SET AS THE REFERENCE LEVEL. */

procreg data=cars2;

model logmpg = American European weight ame_wt eur_wt ;

plot rstudent. * predicted.;

run; quit;

The REG Procedure

Model: MODEL1

Dependent Variable: logmpg

Number of Observations Read 405

Number of Observations Used 397

Number of Observations with Missing Values 8

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 5 34.61181 6.92236 261.03 <.0001

Error 391 10.36909 0.02652

Corrected Total 396 44.98090

Root MSE 0.16285 R-Square 0.7695

Dependent Mean 3.10366 Adj R-Sq 0.7665

Coeff Var 5.24696

Parameter Estimates

Parameter Standard

Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 4.22024 0.12910 32.69 <.0001

American 1 -0.14262 0.13676 -1.04 0.2977

European 1 -0.24211 0.16262 -1.49 0.1373

WEIGHT Vehicle Weight (lbs.) 1 -0.00037138 0.00005753 -6.46 <.0001

ame_wt 1 0.00003697 0.00005900 0.63 0.5314

eur_wt 1 0.00009178 0.00007007 1.31 0.1910

Studentized residuals are plotted vs. predicted values to check for constant variance.

Neither interaction term is significant. Interaction terms are dropped and a simpler model is fitted with WEIGHT and two dummies—AMERICAN and EUROPEAN. We also include test statements in the SAS code to test the significance of the factor ORIGIN, which is included in this model as two dummy variables (AMERICAN and EUROPEAN), and the individual variable, WEIGHT. We will compare the F-tests from these test statements in the Proc Reg output to results of the TYPE III F-tests from Proc GLM.

/* REMOVE INSIGNIFICANT INTERACTIONS FROM THE MODEL */

procreg data=cars2;

model logmpg = American European weight ;

plot rstudent. * predicted.;

Test1: test American=0, European=0;

Test2: test weight=0;

run; quit;

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 3 34.55316 11.51772 434.08 <.0001

Error 393 10.42773 0.02653

Corrected Total 396 44.98090

Root MSE 0.16289 R-Square 0.7682

Dependent Mean 3.10366 Adj R-Sq 0.7664

Coeff Var 5.24837

Parameter Estimates

Parameter Standard

Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 4.13055 0.03265 126.53 <.0001

American 1 -0.06439 0.02517 -2.56 0.0109

European 1 -0.02786 0.02685 -1.04 0.3001

WEIGHT Vehicle Weight (lbs.) 1 -0.00033101 0.00001216 -27.21 <.0001

Test Test1 Results for Dependent Variable logmpg

Mean

Source DF Square F Value Pr > F

Numerator 2 0.08931 3.37 0.0355

Denominator 393 0.02653

Test Test2 Results for Dependent Variable logmpg

Mean

Source DF Square F Value Pr > F

Numerator 1 19.65119 740.61 <.0001

Denominator 393 0.02653

Now let’s fit an ANCOVA model in Proc GLM. In Proc GLM, you need to use a class statement to name the categorical variable(s) before the model statement. By default, the last level is set as the reference level. In our case, the last level of ORIGIN is 3 (=Japanese).

/* FITTING ANCOVA MODEL USING PROC GLM */

procglm data=cars;

class origin;

model logmpg = origin weight/ solution;

output out=regdat p=predict r=resid rstudent=rstudent;

lsmeans origin;

run; quit;

The GLM Procedure

Class Level Information

Class Levels Values

ORIGIN 3 1 2 3

Number of Observations Read 405

Number of Observations Used 397

Dependent Variable: logmpg

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 3 34.55316332 11.51772111 434.08 <.0001

Error 393 10.42773232 0.02653367

Corrected Total 396 44.98089564

R-Square Coeff Var Root MSE logmpg Mean

0.768174 5.248368 0.162892 3.103662

Source DF Type I SS Mean Square F Value Pr > F

ORIGIN 2 14.90197036 7.45098518 280.81 <.0001

WEIGHT 1 19.65119296 19.65119296 740.61 <.0001

Source DF Type III SS Mean Square F Value Pr > F

ORIGIN 2 0.17862963 0.08931481 3.37 0.0355

WEIGHT 1 19.65119296 19.65119296 740.61 <.0001

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept 4.130553750 B 0.03264613 126.53 <.0001

ORIGIN 1 -0.064389997 B 0.02516936 -2.56 0.0109

ORIGIN 2 -0.027857559 B 0.02685074 -1.04 0.3001

ORIGIN 3 0.000000000 B . . .

WEIGHT -0.000331005 0.00001216 -27.21 <.0001

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the

normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely

estimable.

Least Squares Means

logmpg

ORIGIN LSMEAN

1 3.08440706

2 3.12093949

3 3.14879705

/* MODEL DIAGNOSTICS FOR GLM */

goptions reset=all;

procgplot data=regdat;

plot rstudent*predict;

run; quit;

procunivariate data=regdat;

var rstudent;

histogram/normal;

qqplot /normal (mu=est sigma=est);

run;