Simple Linear Regression Assumptions/Transformations:
- Linearity – the scatterplot indicated a straight line relationship
- Line may be curved
- May have outliers or influential observations – points may not really belong to your population.
- Constant Variance – the residual plot indicates constant scatter from the regression line
- Confidence interval for slope may be inadequate and the test may be inadequate.
- Normality – the QQplot of the residuals indicated Normality of residuals
- The prediction intervals may be seriously out of whack when this occurs.
- Independence – The location of any response in relation to its mean cannot be predicted, either fully or partially, from knowledge of where other responses are in relation to their means. Biggest violation of this is what is called serial correlation. Draw picture of that phenomona.
- Use a more advanced technique such as time series analysis.
Go over the pictures on the handout.
Let’s look at Example:
In an industrial laboratory, under uniform conditions, batches of electrical insulating fluid were subjected to constant voltages until the insulating property of the fluids broke down. Seven different voltage levels, spaced 2 kilovolts (kV) apart from 26 to 38 kV, were studied. The measured responses were the times, in minutes, until breakdown, as listed in the table below.
Log Transform of Y
Picture of untransformed data
data work.voltage;
set mydata.voltage;
run;
quit;
symbol1 value=dot;
proc gplot data=work.voltage;
plot time*voltage;
run;
If we take the Log transform note log is really ln in SAS) of Y– we obtain:
data work.voltage2;
set work.voltage;
logtime=log(time);
run;
proc gplot data=work.voltage2;
plot logtime*voltage;
run;
Ods graphics on;
proc reg data=work.voltage2;
model logtime=voltage /clparm ;
output out=work.fitvolt p=yhat r=resid;
run;
Remember, , but now we are fitting a log transform, so our model is
All this math tells us that we if we increase the voltage by one unit, then resulting time to failure is exp(-.51)=.60 of the previous version – so the median time to failure decreases by 40%. So if X=30 volts, we predict Log Y=18.955-.507*30=3.745, exp(3.745)=42.31. So at 30 volts are predicted median of failure time is 42.31. At 31 volts, are median of failure time is
.
Notice that . So at 32 volts, we would have .6*(25.48)=15.288.
We can use the 95% confidence bounds as well.
So we are 95% confident that each unit increase in voltage predicts a median between 53.7% and 67.5% of the current prediction. OR we are 95% confident that each unit increase in voltage predicts a median that decreases between 32.5% and 46.7%.
Log Transform of X
A certain kind of meat processing may begin once the pH in ostmortem muscle of a steer carcass decreases to 6.0 from a pH at time of slaughter of around 7.0 to 7.2. It is not practical to monitor the pH decline for each animal, so an estimate is needed of the time after slaughter at which the pH reaches 6.0. To estimate this time, 10 steer carcasses were assigned to be measured for pH at one of five times after slaughter. Here is the data:
TIME / PH1.00 / 7.02
1.00 / 6.93
2.00 / 6.42
2.00 / 6.51
4.00 / 6.07
4.00 / 5.99
6.00 / 5.59
6.00 / 5.80
8.00 / 5.51
8.00 / 5.36
Actual data plot: (Notice for this experiment time is the and pH is the .
proc gplot data=work.slaughter;
plot ph*time;
run;
Now with Time logged
data work.slaughter2;
set work.slaughter;
logtime=log(time);
run;
proc gplot data=work.slaughter2;
plot ph*logtime;
run;
proc reg data=work.slaughter2;
model ph=logtime /clb ;
output out=work.fitslaughter p=yhat r=resid;
run;
quit;
Among other graphs we have …
In this example, we have:
Since =-.7257. This means if at 2 minutes, we predict the ph to be
If we double the time to 4mintes, our predicted ph is:
We can predict this by looking at log(2)=-.503 – that doubling the time decreased the ph by .5 units. We are 95% confident that doubling the time decreases the ph by .45 to .56 units.
What does tripling the time do? It decreases the ph by .8 units.
Log Transform of both Y and X
Biologists have noticed a consistent relation between the area of islands and the number of animal and plant species living on them. If S is the number of sp;ecies and A is the area, then (roughly), where C is a constant and is a biologically meaningful parameter that depends on the group of organisms (birds, reptiles, or grasses, for example). Estimates of this relationship are useful in conservation biology for predicting species extinction rates due to diminishing habitat.
The data below are the numbers of reptile and amphibian species and the island areas for seven islands in the West Indies.
Island / AREA / SPECIESCuba / 44218.00 / 100.00
Hispaniola / 29371.00 / 108.00
Jamaica / 4244.00 / 45.00
Puerto Rico / 3435.00 / 53.00
Montserrat / 32.00 / 16.00
Saba / 5.00 / 11.00
Redonda / 1.00 / 7.00
Look at scatterplots:
data work.island;
set mydata.island;
run;
quit;
proc gplot data=work.island;
plot species*area;
run;
data work.island2;
set work.island;
logarea=log(area);
logspecies=log(species);
run;
With X logged
proc gplot data=work.island2;
plot species*logarea;
run;
with X and Y logged
proc gplot data=work.island2;
plot logspecies*logarea;
run;
proc reg data=work.island2;
model logspecies=logarea /clb ;
output out=work.fitisland p=yhat r=resid;
run;
To make a prediction for a country with area of 10,000 square miles, we have:
Since X is logged, we know that a twofold increase in X will result in a change of log(2)*(.25)=.173. So that means there will be a change in .173 units of LogY. So that means we have a exp(.173) change in Y which is exp(.173)=1.19. So a doubling of area results in 19% increase in expected median number of species. (1.19)*69.41=82.60. The 95% confidence interpretation is that we are 95% confident that a doubling of area leads to an increase between 2^(.219)=1.16 and 2^(.281)=1.22, (16 and 22%).
1