CSSS 508: Intro R

3/01/06

Lab 8: Stepwise Regression / ANOVA

We’ve seen how to find a regression equation using lm( ).

We give it a response variabley and a list of predictor variables: x1, x2, x3, …, xp

While the individual variables might all be significant predictors of your response variable on their own, including all variables in the model may change the significance of some of them.

Some of the variables may be highly correlated. That is, the information in one variable might be closely copied in another variable. We won’t need both variables in the model.

(You don’t want to cram your model full of unnecessary variables.)

Let’s generate some data along a parabola:

x<-seq(0,10,length=100)

y<-2*x^2+rnorm(100,0,2)

plot(x,y)

Fitting a line to the data:

fit1<-lm(y~x)

summary(fit1)

abline(fit1,col=”green”)

Fitting a curve to the data:

x2<-x^2

fit2<-lm(y~x2)

summary(fit2)

points(x,fit2$fit,col=2,pch=16)

Putting both terms in:

fit.both<-lm(y~x+x2)

summary(fit.both)

Only the x2 term matters; that’s the important part of the model in the fitting. The linear term was significant only in the absence of the parabola term. These two terms are perfectly correlated; the same information is in both of them. We only want the best one.

Stepwise Regression:

The step( ) function will choose the best combination of variables for you.

help(step)

We pass in a fit using ALL the variables.

Then step goes through and builds a model using the full model on one of three ways:

1) Forward Selection

Picks the best variables one at a time until there are no more remain. The variable with the most contribution to the model is chosen first. Then the variable with the most contributiongiven the contribution of the first one is chosen next. Then the variable with the most contribution given the contributions of the first variable and the second variable is chosen next. And so on.

2) Backward Selection

Starts with the full model. Removes variables with small contribution one at a time until only variables with significant model contributions remain. The variable with the least contribution is removed first. The new model is refit and the variable with the least contribution is removed. And so on.

3) Both

Alternates forward and backward selection steps. Allows for the possibility that once a variable has entered the model, other variables might no longer be significant contributors (once model has been refit).

You do not necessarily get the same model with every selection method.

Looking at our previous example:

summary(fit.both)

step(fit.both)

Chooses only the x2 term.

You could use step to help you choose what order to model your data (x, x2, x3, x4, etc.)

You just need to create variables for each order, fit a full model, and then pass the full model into the step function.

Recall the crabs data from HW 7. The variables were all highly correlated. If we model one of the measurements, the model likely will not need all the remaining variables.

library(MASS)

pairs(crabs)

species<-crabs[,1]

gender<-crabs[,2]

FL<-crabs[,4]

RW<-crabs[,5]

CL<-crabs[,6]

CW<-crabs[,7]

BD<-crabs[,8]

Let’s try to model Frontal Lobe Width.

FL.fit<-lm(FL~species+gender+RW+CL+CW+BD)

FL.fit

Note that the species and gender variables are factor variables and that the coefficients are associated with the O species and the M gender.

step(FL.fit)

The procedure chose the combination of:

species, gender, Carapace Length, and Carapace Width.

Let’s look back at pairs(crabs)and the relationship with the chosen variables.

Also,

boxplot(FL[species==”O”],FL[species==”B”],names=c(“Orange”,”Blue”))

boxplot(FL[gender==”F”],FL[gender==”M”],names=c(“Female”,”Male”))

cor(FL,CL)

cor(FL,CW)

While you might be able to find better individual variables for the model, step chose the best combination of variables for the model.

ANOVA: Analysis of Variance

We can find analysis of variance tables in a couple of different ways.

help(anova)

help(aov)

anova gives a sequential analysis of variance table.

It takes the results of one or more lm( ) as arguments.

If you just pass in one model object, it returns a table with tests for the sequential significance of each of the terms. It looks at the variables in the order you typed them in the model and will change if you change the order of the terms.

anova(FL.fit)

Note that RW appears as significant in the table. Our step procedure did not select RW. Would it be better if RW were in the model?

If you pass in two model objects, anova tests the models against each other.

(Often used with models differing only by the inclusion of one extra variable.)

FL.fit1<-lm(FL~species+gender+CL+CW)

FL.fit2<-lm(FL~species+gender+CL+CW+RW)

anova(FL.fit1,FL.fit2)

The test is not significant.

Adding RW to the model does not improve the model significantly.

The function aov is used with a model formula rather than an lm object.

fit.aov<-aov(FL~species+gender+CL+CW)

summary(fit.aov)

This aov object has similar information to the lm object.

names(fit.aov)

Adding RW:

fit.aov2<-aov(FL~species+gender+CL+CW+RW)

summary(fit.aov2)

Again, once the other variables are in the model, RW is not a significant contributor.

Rebecca Nugent, Department of Statistics, U. of Washington - 1 -