Analysis of Covariance (ANCOVA)
Using SAS
(commands = day3_finan_ancova.sas)
The commands in this handout use the dataset CARS.sas7bdat.
/*************************************
ANALYSIS OF COVARIANCE (ANCOVA)
COMMANDS=DAY3_FINAN_ANCOVA.SAS
**************************************/
/*Specify generic output using Options Formchar= */
options formchar="|----|+|---+=|-/\>*";
title;
/* INDICATE LIBRARY CONTAINING PERMANENT SAS DATA SET "CARS" */
libname sasdata2 V9 "C:\temp\sasdata2";
/* FROM REGRESSION ANALYSIS SECTION WE KNOW THAT THE LOG
TRANSFORMED DEPENDENT VARIABLE LOGMPG PERFORMS BETTER
THAN THE ORIGINAL VARIABLE MPG.
THE OUTLIER WITH YEAR=0 IS EXCLUDED FROM THE DATASET. */
data cars;
set sasdata2.cars;
logmpg = log(mpg);
if year ne 0;
run;
/* VEHICLE WEIGHT IS A SIGNIFICANT PREDICTOR FOR MPG. */
procreg data=cars;
model logmpg = weight;
run; quit;
The REG Procedure
Model: MODEL1
Dependent Variable: logmpg
Number of Observations Read 405
Number of Observations Used 397
Number of Observations with Missing Values 8
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 34.37453 34.37453 1280.17 <.0001
Error 395 10.60636 0.02685
Corrected Total 396 44.98090
Root MSE 0.16386 R-Square 0.7642
Dependent Mean 3.10366 Adj R-Sq 0.7636
Coeff Var 5.27971
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 4.13994 0.03011 137.50 <.0001
WEIGHT Vehicle Weight (lbs.) 1 -0.00034939 0.00000977 -35.78 <.0001
In ANOVA section we saw that the MPG is also significantly related to ORIGIN. When we add ORIGIN to the regression model along with WEIGHT, we are actually fitting an Analysis of covariancemodel. Analysis of covariance, or ANCOVA, is the name given to a linear model when the predictors include a combination of continuous variables and categorical variables.
In this section we are showing you two ways of fitting an ANCOVA model in SAS: Proc Reg and Proc GLM (General Linear Models).
/* CREATE DUMMY VARIABLES FOR ORIGIN */
data cars;
set cars;
if origin ne . then do;
American = (origin = 1);
European = (origin = 2);
Japanese = (origin = 3);
end;
run;
/* CHECK THE RECODING */
procfreq data=cars;
tables origin American European Japanese;
run;
procsort data=cars;
by origin;
run;
procprint data=cars;
var logmpg weight origin American European Japanese;
run;
/* SCATTERPLOT OF LOGMPG AGAINST WEIGHT FOR EACH ORIGIN */
goptions reset=all;
symbol1 color=black value = dot line=1 interpol=rl;
symbol2 color=black value = plus line=2 interpol=rl;
symbol3 color=black value = circle line=3 interpol=rl;
procgplot data=cars;
plot logmpg * weight= origin ;
run; quit;
/* THE SCATTERPLOT SUGGESTS THAT THERE MAY BE AN INTERACTION
BETWEEN ORIGIN AND WEIGHT.*/
/* CREATE INTERACTION TERMS.*/
data cars2;
set cars;
ame_wt = American * weight;
eur_wt = European *weight;
jap_wt = Japanese * weight;
run;
When we fit an ANCOVA model in PROC REG, we need to create dummy variables for the categorical variable. Suppose a categorical variable has m levels. Only m-1 dummy variables are needed for the regression model. The dummy variable being left out is set as the reference level.
/* ANCOVA IN PROC REG: INCLUDES ONE CONTINUOUS VARIABLE, TWO DUMMY VARIABLES FOR ORIGIN, AND THE INTERACTIONS. DUMMY VARIABLE JAPANESE IS LEFT OUT AND ISAUTOMATICALLY SET AS THE REFERENCE LEVEL. */
procreg data=cars2;
model logmpg = American European weight ame_wt eur_wt ;
plot rstudent. * predicted.;
run; quit;
The REG Procedure
Model: MODEL1
Dependent Variable: logmpg
Number of Observations Read 405
Number of Observations Used 397
Number of Observations with Missing Values 8
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 5 34.61181 6.92236 261.03 <.0001
Error 391 10.36909 0.02652
Corrected Total 396 44.98090
Root MSE 0.16285 R-Square 0.7695
Dependent Mean 3.10366 Adj R-Sq 0.7665
Coeff Var 5.24696
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 4.22024 0.12910 32.69 <.0001
American 1 -0.14262 0.13676 -1.04 0.2977
European 1 -0.24211 0.16262 -1.49 0.1373
WEIGHT Vehicle Weight (lbs.) 1 -0.00037138 0.00005753 -6.46 <.0001
ame_wt 1 0.00003697 0.00005900 0.63 0.5314
eur_wt 1 0.00009178 0.00007007 1.31 0.1910
Studentized residuals are plotted vs. predicted values to check for constant variance.
Neither interaction term is significant. Interaction terms are dropped and a simpler model is fitted with WEIGHT and two dummies—AMERICAN and EUROPEAN. We also include test statements in the SAS code to test the significance of the factor ORIGIN, which is included in this model as two dummy variables (AMERICAN and EUROPEAN), and the individual variable, WEIGHT. We will compare the F-tests from these test statements in the Proc Reg output to results of the TYPE III F-tests from Proc GLM.
/* REMOVE INSIGNIFICANT INTERACTIONS FROM THE MODEL */
procreg data=cars2;
model logmpg = American European weight ;
plot rstudent. * predicted.;
Test1: test American=0, European=0;
Test2: test weight=0;
run; quit;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 34.55316 11.51772 434.08 <.0001
Error 393 10.42773 0.02653
Corrected Total 396 44.98090
Root MSE 0.16289 R-Square 0.7682
Dependent Mean 3.10366 Adj R-Sq 0.7664
Coeff Var 5.24837
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 4.13055 0.03265 126.53 <.0001
American 1 -0.06439 0.02517 -2.56 0.0109
European 1 -0.02786 0.02685 -1.04 0.3001
WEIGHT Vehicle Weight (lbs.) 1 -0.00033101 0.00001216 -27.21 <.0001
Test Test1 Results for Dependent Variable logmpg
Mean
Source DF Square F Value Pr > F
Numerator 2 0.08931 3.37 0.0355
Denominator 393 0.02653
Test Test2 Results for Dependent Variable logmpg
Mean
Source DF Square F Value Pr > F
Numerator 1 19.65119 740.61 <.0001
Denominator 393 0.02653
Now let’s fit an ANCOVA model in Proc GLM. In Proc GLM, you need to use a class statement to name the categorical variable(s) before the model statement. By default, the last level is set as the reference level. In our case, the last level of ORIGIN is 3 (=Japanese).
/* FITTING ANCOVA MODEL USING PROC GLM */
procglm data=cars;
class origin;
model logmpg = origin weight/ solution;
output out=regdat p=predict r=resid rstudent=rstudent;
lsmeans origin;
run; quit;
The GLM Procedure
Class Level Information
Class Levels Values
ORIGIN 3 1 2 3
Number of Observations Read 405
Number of Observations Used 397
Dependent Variable: logmpg
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 34.55316332 11.51772111 434.08 <.0001
Error 393 10.42773232 0.02653367
Corrected Total 396 44.98089564
R-Square Coeff Var Root MSE logmpg Mean
0.768174 5.248368 0.162892 3.103662
Source DF Type I SS Mean Square F Value Pr > F
ORIGIN 2 14.90197036 7.45098518 280.81 <.0001
WEIGHT 1 19.65119296 19.65119296 740.61 <.0001
Source DF Type III SS Mean Square F Value Pr > F
ORIGIN 2 0.17862963 0.08931481 3.37 0.0355
WEIGHT 1 19.65119296 19.65119296 740.61 <.0001
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 4.130553750 B 0.03264613 126.53 <.0001
ORIGIN 1 -0.064389997 B 0.02516936 -2.56 0.0109
ORIGIN 2 -0.027857559 B 0.02685074 -1.04 0.3001
ORIGIN 3 0.000000000 B . . .
WEIGHT -0.000331005 0.00001216 -27.21 <.0001
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the
normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely
estimable.
Least Squares Means
logmpg
ORIGIN LSMEAN
1 3.08440706
2 3.12093949
3 3.14879705
/* MODEL DIAGNOSTICS FOR GLM */
goptions reset=all;
procgplot data=regdat;
plot rstudent*predict;
run; quit;
procunivariate data=regdat;
var rstudent;
histogram/normal;
qqplot /normal (mu=est sigma=est);
run;
1