MATH 075SPRING 2015

Minitab Module 4: Nonlinear Regression (Part 1)

Introduction

Many data sets can be modeled well by a straight line (weight v. height, cigarette smoking v. cancer risk); there are also a large number that are not adequately described well by the linear form. For this latter group, we will now study some introductory techniques of nonlinear regression.

Finding a line-of-best fit for curvilinear data is an adventure that requires care and, at times, sophistication. The first two nonlinear functions we will study in this section are the exponential and logarithmic curves. These two curves are intrinsically linear, which simply means we use can transform them to a linear form and then use the methods previous explored in Module 3 for linear regression. Similar to our methods for linear regression, we will be using values like r2, Se, and as importantly the residual plots to assess the “goodness of fit.” However, with curvilinear data, we first need to determine which of the many possible functions would provide the best fit. With this one exception, the process follows the same guidelines as linear regression in Module 3. For our introductory course, we will limit the options of curvilinear functions to three types: exponential, logarithmic, and quadratic.

Review of Conditions for Linear Regression (from Module 3, Topic 3.3)

When assessing how well the linear regression model fits the data, we examine the following criteria:

1)  The linear regression model must have two quantitative variables.

2)  The scatterplot does not contain any overly influential outliers.

3)  The form of the scatterplot is linear.

The Conditions for Nonlinear Regression are nearly identical, except for Criteria 3, where we replace “linear” with “nonlinear.”

As in Module 3, one can use Minitab to provide diagnostics to verify that our statistical model is not violating Criteria 2 - 3. A scatterplot (Graph > Scatterplot) provides a visual representation for verifying Criteria 3 if the data follows a linear or nonlinear trend. The Residual Plot (from Topic 3.3) allows us to provide evidence about Criteria 2 and 3 (Stat>Regression>Fitted Line Plot>Graphs>Residuals versus the variables). If there is a discernible trend in the data or pattern, then it might suggest that we are missing some interaction and may have the wrong model. The Residual Plot with no trend or pattern is a good candidate for the optimal model.

As a caution, there are fatal consequences if we choose the wrong shape (Criteria 3), one or two points are overly influential (outliers that are far from the average x-value), the data are not independent, or you extrapolate (base a prediction outside the range of the x-values). Failure to meet the other conditions can be remedied through advanced techniques that we will not consider in this course

The first step we perform in the analysis of two quantitative variables is to make the scatterplot of the dataset. We will expand our current modeling beyond linear with some of the following curves:

1.  exponential growth

2.  exponential decay

3.  logarithmic growth

or

4.  one of the quadratic functions—concave up or concave down

In many cases, you will need to investigate two or more functions to find the model that fits the data best.

For the first three curvilinear functions above, statisticians linearize the function by using the inverse function (recall that exponential and logarithmic functions are inverses of each other).

A quadratic form can be easily handled directly without transforming.

For exponential growth, we will take the log10 of each of the y-values

For exponential decay and logarithmic growth, we will take the log10 of the x values. (Minitab makes this process straight forward, as you will see later in the worksheet).

TUKEY CHART (will be presented in class)

Sometimes it will be difficult to distinguish the Curve-of-Best-Fit using only the scatterplot (for example, between a logarithmic and quadratic opening downward or between an exponential decay/exponential growth and a quadratic opening upward). In these cases, we will need to find the best fit for each class of functions, complete an analysis between the different classes comparing values of r2 and , and assess the residual plot. Remember that just because a specific model is the best fit, it does NOT automatically follow that it is a valid fit.

A famous quote from 20th century statistician George E. P. Box:

“Essentially all models are wrong, but some are useful.”

Now we will look more closely at three examples of how to construct models for the three different curvilinear functions and how to assess their fit.

Part 1:

Case 1: Investigating fatalities due to drunk drivers

From the Executive Summary of Statistical Analysis of Alcohol-Related Driving Trends, 1982-2005 (http://www-nrd.nhtsa.dot.gov/Pubs/810942.pdf,

“The number of fatal crashes that involved drivers who had been drinking at the time of the crash has decreased during the past two decades. The proportion of crash fatalities that are alcohol-related –that occurred in crashes where at least one of the drivers and/or nonoccupants involved had a blood alcohol concentration (BAC) of .08 or above – decreased at a steady rate from 53 percent in 1982 to 34 percent in 1997. It leveled off for two years and then increased by 1 percent in 2000 and remained at that level for two more years before it decreased to 33 percent in 2005.The proportion of drivers involved in fatal crashes who had BAC of .08 or above decreased from 35 percent in 1982 to 20 percent in 1997 and leveled off thereafter.”

We want to study the data to determine the model that fits the data. As outlined above, the first step is to look at the scatterplot created from the data on the left.

Go to my website and open the “Module 4: Drunk Driver Fatal Accidents Data” Excel file.

Copy and paste all columns from Excel into Minitab and create the Minitab scatterplot that matches the above scatterplot.

1. Based on the scatterplot, which curvilinear models do you think we should examine? (Answer on worksheet provided)

______

You will now learn how to create the Minitab linear regression model plots we used in Topic 3.3.

We will examine how well the linear model fits first.

Linear Model

Create the Fitted Line Plot and Residual Plot

·  Stat > Regression > Fitted Line Plot

and choose “Number of Drunk Drivers in Fatal Crashes” as the response Y and “Number of Years Since 1980” as the predictor X

For the Type of Regression Model, choose Linear

Then to create the Residual Plot,

·  Click Graphs

Residuals for Plots should be automatically selected as Regular

Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X)

Click OK twice

Copy and paste these graphs into a new Word document that you will save for reference.

Questions: (Tips for your Module project) Discuss with your partner.

a.  For the explanatory variable, why didn’t we use “Year?”

b.  For the explanatory variable, why did we rewrite it as “Number of Years since 1980” instead of “Number of Years since 1982?”


(The above graphs are provided for you to verify with the ones you have created with minitab.)

Please record the following information from your Fitted Line Plot on your worksheet:

2. The equation of the linear regression model is: ______

3. Complete the following table.

Model / Residual Plot
What do you see?
(oval, band, fan shape, curvilinear pattern, influential outliers) / /
(Addresses Criteria 2, 3)
Linear

4. What did you notice about the Versus Fit Plot (the residual plot) for the linear regression model?

What is a second regression model we should consider?

We will now examine how well the exponential decay model fits.

Exponential Decay Model

In order to fit the data to this model using Minitab, we need to take the log10 for each x-value. This is easily accomplished using the following directions.

Create the Fitted Line Plot and Residual Plot

·  Stat > Regression > Fitted Line Plot

and select the appropriate variable for the response Y and predictor X

For the Type of Regression Model, choose Linear

Then to create the Residual Plot

·  Choose Graphs

Residuals for Plots should be automatically selected as Regular

Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X)

Click OK

To create the Exponential Decay regression model,

·  Choose Options

Under Transformations, check Log 10 of X (the 2nd box in the left column)

Click OK twice

Please copy and paste your graphs into the same Word document you created earlier below the first two.

Notice the difference if we repeat the above directions, except now:

·  Choose Options

Under Transformations, check Log 10 of x and also check Display logscale for X-variable

Click OK twice

Please copy and paste these two graphs into the same Word document you created earlier.

Question: What similarities and differences do you notice about the output from both of the exponential decay models?

> WARNING <

When using Minitab to create different models (linear, exponential, quadratic, logarithmic), be sure you double check the buttons you’ve selected, so that you don’t inadvertently combine several models together.

Please record the following information from your Fitted Line Plot

5. The equation of the Exponential Decay regression model is:______

6. Complete the following table.

Model / Residual Plot
What do you see?
(oval, band, fan shape, curvilinear pattern, influential outliers) / /
(Addresses Criteria 2, 3, 4)

7. Complete the following.

The exponential decay model is an improvement over the linear model because the Se decreased from ______to ______. In the context of this model, that means for this exponential decay model, on average the prediction for the number of drivers involved in fatal accidents will be off by approximately ±______drivers, an error reduction of ______% compared to the linear model which can be computed by the following:[(amount of decrease/SE linear]x100.

The r2 increased from ______% to ______%, which in the context of the problem means that the number of years since 1980 explains ______% of the total variation in the number of drunk driver fatal accidents or a ______% [(amount of increase/ r2 of linear model)x100] increase from the linear model. Also, the residuals are (circle one) more/less normally distributed for the exponential decay model.

Part 2

Case 2: Eagles

During the mid-20th century, the population of bald eagles in the lower 48 states declined substantially. A highly toxic pesticide, DDT, was the main cause of the decline. DDT causes damage to bird egg shells. By 1963, bald eagles were in danger of complete extinction. Only 417 pairs of bald eagles remained. In 1967, the bald eagle became an official endangered species. Then in 1972, the EPA banned the use of DDT in the United States. The impact of the ban was a dramatic turnaround in the fate of the bald eagle.

Note that in the table of data below, we defined our explanatory variable t to be Years after 1950. The response variable is the number of bald eagle pairs that are mating.

It appears to be a strong candidate for the exponential model because the values are increasing more rapidly as each year passes. However, we can first check the linear model and obtain the Fitted Line Plot using the directions from Case 1 and the data set (from my website) called “Module 4: Eagles and Bears Data.” Copy and paste all columns from Excel into Minitab.

Linear Model

Create the Fitted Line Plot and Residual Plot

·  Stat > Regression > Fitted Line Plot

and select the appropriate variable for the response Y and predictor X

For the Type of Regression Model, choose Linear

Then to create the Residual Plot

·  Choose Graphs

Residuals for Plots should be automatically selected as Regular

Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X)

Click OK

Please copy and paste your Minitab output into the same Word document you created earlier.

Please record the following information from your Fitted Line Plot

8. The equation of the Linear regression model is: ______

9. Complete the following table for the linear model only:

Model / Residual Plot
What do you see?
(oval, band, fan shape, curvilinear pattern, influential outliers) / /
++
(Addresses Criteria 2, 3)
Linear**
Exponential

Important: **According to Occam’s Razor, the linear model is considered the preferred model unless one of the nonlinear models is significantly better (over a 4% increase in the value of ).

++ If one of your best fit options is the exponential growth model (when using transformations), is no longer meaningful and cannot be used to compare models. You can compare values of .

Questions (discuss with your partner):

Is the linear regression model appropriate for the eagle pairs data? Why or why not?

If not, which regression model do you think would be a better fit?

Let’s try the exponential growth model:

Exponential Growth Model

The process we use to produce the exponential growth model regression model is similar to how Minitab generates the exponential decay regression model, except now we choose log10 Y in the (Stat Regression Fitted Line Plot>Options) window.

Create the Fitted Line Plot with Residual Plot

·  Stat > Regression > Fitted Line Plot

and select the appropriate variable for the response Y and predictor X

For the Type of Regression Model, choose Linear

Then to create the Residual Plot

·  Choose Graphs

Residuals for Plots should be automatically selected as Regular

Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X)