Multiple Regression (Chapter 6)
This label refers to the situation when you have more than one explanatory variable. The general form a multiple regression model is with p predictors:
Now we could have a quadratic model where
.
Now you know how obsessed I am with making pictures to discern a relationship. It is more difficult to do for multiple regression models. With a quadratic model, it still can be done. It can also be done in some other cases. As the number of independent or explanatory variables increases, so the difficulty in discerning visually a picture of what is going on.
Challenges of Multiple Regression:
- It is more difficult to chose the best model
- It is difficult to visualize the fitted model.
- It is sometimes difficult to interpret what the best-fitting model means in real life terms.
- Need computers
Another challenge is that often the explanatory variables themselves will be correlated. Imagine using age and height to predict weight. Well it is not a surprise to see that age and height are correlated themselves.
Assumptions:
- Independence – Each of the Y values are independent – violated when y-values are on the same subject – if we have time this semester, I will go over a procedure that takes into account that non-independence.
- linearity – The model predicts the average Y for the given X’s., ie
- Homoscedasticity – Constant variance
- Normality -
We use the least squares approach to estimate the parameters of our model –
We have:
Again, the residual is defined as the difference between the actual Y value and the predicted value.
We assess how well the independent variables predict the Y variable with what is called the Multiple Correlation Coefficient. It is the correlation between the Y’s and the predicted values (’s).
Again, we have that
so we interpret it in the same way as before, it gives the percent of the variation in Y that is explained by variation in the Xvariables.
Numerical Example
I. Background
The following table contains data on mortality due to malignant melanoma of the skin of white males during the period 1950-1969, for each state in the United States as well as the District of Columbia. No mortality data are available for Alaska and Hawaii for this period. It is well known that the incidence of melanoma can be related to the amount of sunshine and, somewhat equivalently, the latitude of the area. The table contains the latitude as well as the longitude for each of these states. These numbers were simply obtained by estimating the center of the state and reading off the latitude as given in a standard atlas. Finally, the 1965 population and contiguity to an ocean are noted, where “1” indicated contiguity: the state borders one of the oceans.
Mortality per / Latitude / Longitude / Population / OceanState / 10,000,000 / (degrees) / (degrees) / (million, 1965) / State
Alabama / 219 / 33.00 / 87.00 / 3.46 / 1
Arizona / 160 / 34.50 / 112.00 / 1.61 / 0
Arkansas / 170 / 35.00 / 92.50 / 1.96 / 0
California / 182 / 37.50 / 119.50 / 18.60 / 1
Colorado / 149 / 39.00 / 105.50 / 1.97 / 0
Connecticut / 159 / 41.80 / 72.80 / 2.83 / 1
Delaware / 200 / 39.00 / 75.50 / 0.50 / 1
Washington, DC / 177 / 39.00 / 77.00 / 0.76 / 0
Florida / 197 / 28.00 / 82.00 / 5.80 / 1
Georgia / 214 / 33.00 / 83.50 / 4.36 / 1
Idaho / 116 / 44.50 / 114.00 / 0.69 / 0
Illinois / 124 / 40.00 / 89.50 / 10.64 / 0
Indiana / 128 / 40.20 / 86.20 / 4.88 / 0
Iowa / 128 / 42.20 / 93.80 / 2.76 / 0
Kansas / 166 / 38.50 / 98.50 / 2.23 / 0
Kentucky / 147 / 37.80 / 85.00 / 3.18 / 0
Louisiana / 190 / 31.20 / 91.80 / 3.53 / 1
Maine / 117 / 45.20 / 69.00 / 0.99 / 1
Maryland / 162 / 39.00 / 76.50 / 3.52 / 1
Massachusetts / 143 / 42.20 / 71.80 / 5.35 / 1
Michigan / 117 / 43.50 / 84.50 / 8.22 / 0
Minnesota / 116 / 46.00 / 94.50 / 3.55 / 0
Mississippi / 207 / 32.80 / 90.00 / 2.32 / 1
Missouri / 131 / 38.50 / 92.00 / 4.50 / 0
Montana / 109 / 47.00 / 110.50 / 0.71 / 0
Nebraska / 122 / 41.50 / 99.50 / 1.48 / 0
Nevada / 191 / 39.00 / 117.00 / 0.44 / 0
New Hampshire / 129 / 43.80 / 71.50 / 0.67 / 1
New Jersey / 159 / 40.20 / 74.50 / 6.77 / 1
New Mexico / 141 / 35.00 / 106.00 / 1.03 / 0
New York / 152 / 43.00 / 75.50 / 18.07 / 1
North Carolina / 199 / 35.50 / 79.50 / 4.91 / 1
North Dakota / 115 / 47.50 / 100.50 / 0.65 / 0
Ohio / 131 / 40.20 / 82.80 / 10.24 / 0
Oklahoma / 182 / 35.50 / 97.20 / 2.48 / 0
Oregon / 136 / 44.00 / 120.50 / 1.90 / 1
Pennsylvania / 132 / 40.80 / 77.80 / 11.52 / 0
Rhode Island / 137 / 41.80 / 71.50 / 0.92 / 1
South Carolina / 178 / 33.80 / 81.00 / 2.54 / 1
South Dakota / 86 / 44.80 / 100.00 / 0.70 / 0
Tennessee / 186 / 36.00 / 86.20 / 3.84 / 0
Texas / 229 / 31.50 / 98.00 / 10.55 / 1
Utah / 142 / 39.50 / 111.50 / 0.99 / 0
Vermont / 153 / 44.00 / 72.50 / 0.40 / 0
Virginia / 166 / 37.50 / 78.50 / 4.46 / 1
Washington / 117 / 47.50 / 121.00 / 2.99 / 1
West Virginia / 136 / 38.80 / 80.80 / 1.81 / 0
Wisconsin / 110 / 44.50 / 90.20 / 4.14 / 0
Wyoming / 134 / 43.00 / 107.50 / 0.34 / 0
Goal:Fit a multiple regression model with two independent variables (latitude
and proximity to ocean) to predict the dependent variable (mortality).
Now this is kind of interesting – X1 (latitude) is a continuous and quantitative variable, but X2 (proximity to the ocean) is a qualitative variable. When this is the case, we can still get a picture of what is going on.
symbol1color=black value=square;
symbol2color=blue value=circle;
procgplotdata=work.melan;
plot mortal*latitude=ocean;
run;
Now to fit what is called the main effects Model, we have:
procregdata=work.melan;
model mortal=latitude ocean;
run;
Go over SSY, SSE, and SSR, R2, RMSE,
Notice then that our model is:
So on your scatter plot, draw those two lines.
We can also include in our model higher order terms, such as an interaction term:
Model:
data work.melan2;
set work.melan;
latocean=latitude*ocean;
run;
quit;
procregdata=work.melan2;
model mortal=latitude ocean latocean;
run;
Notice that the interaction term is not significant, and that ocean is no longer significant. So we have to be careful. The p-values that are generated are the significance of that term in the model given the other variables are in the model. So the p-value of .6927 means that interaction is not significant in the presence of latitude and ocean. The p-value of .4242 tells us that ocean is not significant when latitude and the interaction are in the model.
We could also try a second order term for latitude – latitude2. (Notice that ocean2 would be redundant. Model:
data work.melan3;
set work.melan2;
lat2=latitude*latitude;
run;
quit;
procregdata=work.melan3;
model mortal=latitude ocean lat2;
run;
Here again we see that the higher order term latitude2 is not useful in predicting mortality.
We will spend more time, next time on model selection.
For our model, , we can check out a residual plot.
odsgraphicson;
procregdata=work.melan3;
model mortal=latitude ocean;
run;
A strategy for data analysis when fitting models:
Example 2:
Evolutionary biologists are keenly interested in the characteristics that enable a species to withstand the selective mechanisms of evolution. An interesting variable in this respect is brain size. One might expect that bigger brains are better, but certain penalties seem to be associated with large brains, such as the need for longer pregnancies and fewer offspring. Although the individual members of the large-brained species may have more chance of surviving, the benefits for the species must be good enough to compensate for these penalties. To shed some light on this issue, it is helpful to determine exactly which characteristics are associated with large brains, after getting the effect of body size out of the way.
The dataset brainsize.sas7bdat has the variables:
Species: The name of the species
Brainweight: weight in grams of the brain
Bodyweight: weight of body in kilograms
Gestation: gestation period in days
Litter: size of the litter
Steps:
- Plot brain vs. bodysize, gestation, and litter (separetly):
data work.brain2;
set work.brain;
logbrain=log(brainweight);
run;
quit;
procgplotdata=work.brain2;
plot logbrain*bodyweight logbrain*gestation logbrain*litter;
run;
quit;
So we need to use the natural logarithm, let us first try it on the Y variable.
Let’s transform both bodyweight and gestation.
data work.brain3;
set work.brain;
logbrain=log(brainweight);
logbody=log(bodyweight);
loggest=log(gestation);
run;
quit;
procgplotdata=work.brain3;
plot logbrain*logbody logbrain*loggest logbrain*litter;
run;
quit;
Model:
procregdata=work.brain3;
model logbrain=logbody loggest litter;
run;
quit;
1