17 - Transformations in Simple Linear Regression
Example - Polychlorinated Biphenyl (PCB) Concentration and Age of Rainbow Trout in Lake Cayuga (New York).
Data File: PCBtrout.JMP
In this experiment we are studying the relationship between age of trout and the PCB concentration found in their tissues. The rainbow trout were all sampled from Lake Cayuga in New York.
Sample Correlation (r)
/ PCB /Age / r = 0.7364
We begin by examining the correlation between age and PCB concentration as well as a scatterplot matrix. The correlation is r = .7364 which is a moderate positive correlation however examination of the scatter plot matrix suggests that the relationship is not linear, thus the correlation really is not an appropriate measure of the association between PCB concentration and age. One approach to dealing with nonlinearity is to transform Y and/or X using power transformations (e.g. square root, logarithms, etc.) in attempts to strengthen the linear association between Y and X. The Bulging Rule shown below provides guidance in terms of the types of transformations to use for the two variables
Bulging Rule:
Here we see that the Bulging Rule suggests lowering the power on Y and/or raising the power on X. Another consideration is the distribution of both variables, which are shown in the histograms augmented to the scatter plot below (select Bivariate Fit … > Histogram Borders). We can see the PCB distribution is markedly skewed to the right, thus a log transformation would improve normality. When X and Y are both normally distributed it is often times the case that the relationship between them will be linear.
Thus we begin by taking the log base 10 of the PCB concentration. To do this, create a new column and double click at the top of the column. Select Formula from the New Property pull-down menu and click Edit Formula. Then select Transcendental from the right hand menu (because the logarithm is a transcendental function) and click on Log 10 from the right menu. Finally select the variable you wish to take the log of, which in this case is PCB concentration. When you are finished the expression in the calculator window should read
Log10(PCB)
The relationship between age and log10(PCB) is summarized below.
Sample Correlation (r)
Age / r = 0.8552
We can see that the correlation has increased after transformation.
There still appears to be some curvature present however. The bend is such that the Bulging rule suggests either raising the power on Y or lowering the power on X. We could consider raising the power on Y, which implies the log transformation of Y may have been too strong. However in this case we will consider lowering the power on X. Let’s consider using the square root of age. Again we need to add a column which will contain the results of a formula. To take the square root of a variable simply select the square root button and put the variable Age under the radical. The result should look like this:
Again examination of the correlation and scatter plot matrix shows improvement in terms of linearity.
Sample Correlation (r)
log10(PCB) / r = 0.8866
At this point we should feel comfortable building a simple linear regression model for these data. Select Analyze > Fit Y by X menu and put log10(PCB) in the Y box and sqrt(Age) in the X box. To fit the regression line, select Fit Line from the Bivariate Fit pull-down menu located above the scatter plot. The results are shown below.
The regression equation is:
To use this equation to predict the PCB concentration for a fish that is 5 years old e.g. we would take the square root of 5 and plug that in to the regression equation. The predicted log 10 PCB concentration would be:
-.519 + .521*2.236 = .645 log10(ppm)
Which corresponds to a PCB concentration of ppm.
The R2 = .786 which implies 78.6% of the variation in the log10 PCB concentration is explained by the regression on the square root of the age of the trout.
To examine a plot of the residuals versus, select Plot Residuals from the Linear Fit pull-down menu located beneath the scatter plot. The resulting plot is shown on the following below.
No assumption violations are evident from this plot. To assess normality of the residuals first save the residuals from the fit by selecting Save Residuals from the Linear Fit pull- down menu. This will save the residual values to the original spreadsheet. Then select Analyze > Distribution to examine the distribution of the residuals and obtain a normal quantile plot. The results are shown below.
The residuals appear to be slightly kurtotic, but not too bad. To obtain prediction and confidence intervals we need to fit the regression model using the Fit Model option from the Analyze menu. Put log10(PCB) in the Y box and sqrt(Age) in the Model Effects box. From the Fit Model results select Save Columns > Prediction Formula, Mean Confidence Interval (CI for E(Y|X)) and Indiv Confidence Interval (CI for the Y value of an individual with X = x). To obtain these CI’s in the original scale (ppm) add two columns to the spreadsheet which will take the results of a formula. Then use the JMP calculator and the function 10x to create the following formulas:
This will transform the endpoints of the confidence interval for the mean back to the original ppm scale. Similar formulas could be used to convert the endpoints of the confidence interval for prediction of a single individual to the original scale.
A portion of the data spread sheet containing these additional columns is shown below.
Plot of PCB vs. Age with Estimated Mean and CI’s Added
You can use Graph > Overlay Plot to graph the original data points (PCB), the predicted values (Pred Orig), the lower confidence limit (Mean Lower), and the upper confidence limit (Mean Upper) in the same plot by placing these quantities in the Y box and either sqrt(Age) or Age in the X box. The plot above shows the results using Age for the X-axis. The Connect Points option has been selected and the Show Points option has been unselected for the predicted values and the confidence bands (right-click on the legend name for each of these quantities to obtain a menu from which these options can be specified).
Example 2 ~ Mercury Levels Found in Sand Point Walleyes
Data File: Walleyes Sand Point
The variables in this data file are:
LTGHIN = length of walleye (in.)
HGPPM = mercury concentration found in fillet (ppm)
Log10(Hg) = log base 10 of the mercury concentration (log 10 ppm)
We begin constructing a plot of mercury level (ppm) vs. length (in.) and adding a smoothing spline as a preliminary estimate of the E(Hg|Length). Clearly the E(Hg|Length) could not be adequately modeled using a linear function of X.
Applying Bulging Rule:
After log transforming mercury level we have the following…
Fitting the regression model with log transformed response we obtain these results.
INTERPRETATION OF THE MODEL WITH THE LOG RESPONSE
Exponential model:
E(Y|X) = here we have a used a base of 10, although the base used is arbitrary.
Taking the log base 10 of the response gives,
E(Y|X) = =
If the distribution of Y is symmetric (e.g. normal) then we can also view this an approximate relationship for the median in the original scale, i.e.
Med(
How are the coefficients,and , in this model interpreted? As always the y-intercept () is the value of the response when X = 0. The slope however has a more interesting interpretation.
With exponential trends, if we change X by 1 unit, the resulting in change in Y is interpreted as a percentage change in the median of Y. The percentage change in the response is the same, for all values of the explanatory variable X.
To see this consider the following:
Now if we increase X by 1 unit we have
Thus for exponential models where Y has been transformed to the log base 10 scale we that…
• if X increases by 1 unit, Y gets multiplied by , a % increase
• if X increases by w units, Y gets multiplied by, a % increase
• Again if is approximately symmetric then these can be thought of as percent increases in the median of Y given X in the original scale.
EXAMPLE 2: Mercury Levels in Sand Point Walleyes and Length (cont’d)
Here the estimated regression is given by:
E(Y|X) = = -1.066535 + .0556892*X
For a one inch increase in walleye length we estimate the mean log10(Hg) level will increase by .0556892 log10 ppm. Converting to the original scale we have a multiplicative increase of = 1.1369, i.e. the median mercury level increases an estimated 1.1369 times per inch, i.e. a 13.69% increase for each one inch increase in length.
For a 5 inch increase we estimate
Confidence Interval for b1
We can find a 95% CI for the percent increase by first finding a 95% CI for and then converting the endpoints to obtain a confidence interval for.
95% CI for is given by
where the df for the t-table value = n – 2.
Estimate and SE’s from JMP
.0556892 = (.0479, .0635)(1.1166 , 1.1574)
Thus we estimate with 95% confidence the percent increase in the median mercury level associated with a one inch increase in length is between 11.66% and 15.74%.
Using the Fit Model approach we can again save predicted values, confidence intervals, and prediction intervals to the spreadsheet and convert them back to the original scale (again using ).
Visualization of raw data, Median(Y|X) and CI for Median(Y|X) using
Graph > Overlay Plot as shown above in the PCB/trout example discussed above
(see pg. 161).
175