Re-Expressing Data:Get It Straight!

Chapter 10

Re-expressing Data:Get it Straight!

Introduction

All quantitative data come to us measured in some way, with units specified. But maybe those units aren’t the best choice.

Why bother changing units?...because some expressions of the data may be easier to think about. And some may be much easier to analyze with statistical methods.

Straight to the Point

We cannot use a linear model unless the relationship between the two variables is linear. Often re-expression can save the day, straightening bent relationships so that we can fit and use a simple linear model.

Some ways to re-express data are with logarithms, reciprocals, square roots, squaring, etc. Re-expressions can be seen in everyday life—everybody does it.

The relationship between fuel efficiency (in miles per gallon) and weight (in pounds) for late model cars looks fairly linear at first:

Discuss what you see here in this scatterplot:

A look at the residuals plot shows a problem:

The original scatterplot shows a negative direction, roughly linear shape, and strong relationship. There do not seem to be any outliers or unusual features.

However, there is a definite bend there in the residuals. Let’s just give up…

No, wait! All is not lost!! We can re-express fuel efficiency as gallons per hundred miles (a reciprocal) and eliminate the bend in the original scatterplot:

The bend in the relationship between Fuel Efficiency and Weight is the kind of failure to satisfy the conditions for an analysis that we can repair by re-expressing the data. This scatterplot is more nearly linear, but the re-expression changes the direction of the relationship.

The direction of the association is positive now because we are measuring gas consumption and heavier cars consume more gas per mile. This new model makes better predictions than the previous one!

A look at the residuals plot for the new model seems more reasonable:

Gallons per hundred miles – what an absurd way to measure fuel efficiency!! Who would ever do it that way?!

Answer: everyone except US drivers…  Most of the world says “I’ve got to go 100 km, how much gas do I need?” But Americans say, “I’ve got 10 gallons in the tank, how far can I drive?” (we’ll revisit this example in a little bit)

Re-expressions think about the data differently, but don’t change what they mean

Goals of Re-expression

Goal 1: Make the distribution of a variable (as seen in its histogram, for example) more symmetric.

Goal 2: Make the spread of several groups (as seen in side-by-side boxplots) more alike, even if their centers differ.

Goal 3: Make the form of a scatterplot more nearly linear (because linear scatterplots are easier to model!)

Goal 4: Make the scatter in a scatterplot spread out evenly rather than thickening at one end.

This can be seen in the two scatterplots we just saw with Goal 3:

Practice Exercises

Page 239 #1 – 4

Practice Answers

Homework

Revisiting some things you learned in Algebra 2…

Page 239 #5, 6, 7

The Ladder of Powers

There is a family of simple re-expressions that move data toward our goals in a consistent way. This collection of re-expressions is called the Ladder of Powers.

The Ladder of Powers orders the effects that the re-expressions have on data.Where to start? You may be wondering what to do to the data to re-express it…It turns out that certain kinds of data are more likely to be helped by particular re-expressions.

Just Checking

You want to model the relationship between the number of birds counted at a nesting site and the temperature (in degrees Celsius). The scatterplot of counts vs. temperature shows an upwardly curving pattern, with more birds spotted at higher temperatures. What transformation (if any) of the bird counts might you start with?

You want to model the relationship between prices for various items in Paris and in Hong Kong. The scatterplot of Hong Kong prices vs. Parisian prices shows a generally straight pattern with a small amount of scatter. What transformation (if any) of the Hong Kong prices might you start with?

You want to model the population growth of the US over the past 200 years. The scatterplot shows a strongly upwardly curved pattern. What transformation (if any) of the population might you start with?

Weight / Fuel Eff.
2600 / 27
2620 / 22
2625 / 26
2625 / 28
2620 / 34
2700 / 28
2700 / 26
2725 / 27
2800 / 22
2825 / 20
2825 / 23
2850 / 22
3025 / 21

Example: Cars 1991

Weight / Fuel Eff.
3100 / 15
3375 / 14
3375 / 18
3355 / 21
3500 / 18
3575 / 17
3700 / 16
3800 / 15
3790 / 16
3850 / 15
3850 / 16
3900 / 13
4000 / 14

Fuel efficiency (mpg) vs. Weight for 38 cars as reported by Consumer Reports

Weight / Fuel Eff.
1875 / 32
1875 / 35
1925 / 31
1925 / 34
1940 / 33
2200 / 37
2225 / 29
2230 / 30
2240 / 30.5
2245 / 34
2250 / 31
2255 / 27

Write the equation for this data

Make a scatterplot of the data using your calculator.

Homework

Page 240 #8, 9, 11, 12

Step-by-Step Example

Standard fishing line comes in a range of strengths, usually expressed as “test pounds.” Five-pound test lines, for example, can be expected to withstand a pull of up to five pounds without breaking. The convention in selling fishing line is that the price of a spool doesn’t vary with strength. Higher test pound line is thicker, though, so spools of fishing line hold about the same amount of material. Let’s look at Length and Strength of spools of fishing line manufactured by the same company and sold for the same price at one store.

Question

How are the Length on the spool and the Strength related? And what re-expression will straighten the relationship?

THINK

I want to fit a linear model for the length and strength of fishing line.

I have the length and “pound test” strength of fishing line sold by a single vendor at a particular store.

Let Length = length in yards of fishing line on the spool

Strength = the test strength in pounds

The plot shows a negative direction and an association that has little scatter but is not straight.

SHOW

Let’s try a re-expression of the data to make it more nearly linear.

Below is a scatterplot of the square root of Length against the strength:

The plot is less bent, but still not straight.

Let’s try the logarithm of Length against Strength:

This is much better, but still not straight, so let’s take another step up the ladder to the reciprocal

Maybe now I moved too far along the ladder. A half-step back is the -½ power (the negative reciprocal square root) (we use negative to preserve the direction of the relationship)

TELL

It’s hard to choose between the last two alternatives. Either of the last two choices is good enough. What should we choose? I’m going to go with the negative reciprocal of the length.

Now that the re-expressed data satisfies the straight enough condition, we can fit a linear model by least squares. Using a calculator, I found:

We can use this model to predict the length of a spool, say, 35-pound test line:

We could leave the result in these units, but here we want to transform the predicted value back into yards. Length = -1/-0.001449 = 690 yards

Example

Plan B: Attack of the Logarithms

When none of the data values is zero or negative, logarithms can be a helpful ally in the search for a useful model. Try taking the logs of both the x- and y-variable. Then re-express the data using some combination of x or log(x) vs. y or log(y).

TI Tips

Let’s revisit the Arizona State tuition data. Recall that when we tried to fit a linear model to the yearly tuition costs, the residuals plot showed a distinct curve.

This curved pattern indicates that data re-expression may be in order. If you have no clue which re-expression to try, the Ladder of Powers may help. We just used that approach in the fishing line example. Here, though, we can paly a hunch. It is reasonable to suspect that tuition increases at a relatively consistent percentage year by year. This suggests that using the logarithm of tuition may help.

Tell the calculator to find the logs of the tuitions, and store them as a new list. Perform the regression for the logarithm of tuition vs. year

Do you know what the model’s equation is? Remember, it involves a logarithm!

Can you estimate the tuition for 2001? Make sure you think!!!

Example

Multiple Benefits

We often choose a re-expression for one reason and then discover that it has helped other aspects of an analysis. For example, a re-expression that makes a histogram more symmetric might also straighten a scatterplot or stabilize variance.

Why Not Just Use a Curve?

If there’s a curve in the scatterplot, why not just fit a curve to the data?

The mathematics and calculations for “curves of best fit” are considerably more difficult than “lines of best fit.” Besides, straight lines are easy to understand. We know how to think about the slope and the y-intercept.

What Can Go Wrong?

Don’t expect your model to be perfect.

Don’t stray too far from the ladder.

Don’t choose a model based on R2 alone:

Beware of multiple modes.

Re-expression cannot pull separate modes together.

Watch out for scatterplots that turn around.

Re-expression can straighten many bent relationships, but not those that go up then down, or down then up.

Watch out for negative data values.

It’s impossible to re-express negative values by any power that is not a whole number on the Ladder of Powers or to re-express values that are zero for negative powers.

Watch for data far from 1.

Data values that are all very far from 1 may not be much affected by re-expression unless the range is very large. If all the data values are large (e.g., years), consider subtracting a constant to bring them back near 1.

What have we learned?

When the conditions for regression are not met, a simple re-expression of the data may help.

A re-expression may make the:

Distribution of a variable more symmetric.

Spread across different groups more similar.

Form of a scatterplot straighter.

Scatter around the line in a scatterplot more consistent.

Taking logs is often a good, simple starting point.

To search further, the Ladder of Powers or the log-log approach can help us find a good re-expression.

Our models won’t be perfect, but re-expression can lead us to a useful model.

Homework

Page 241 #15, 17, 27 and Alligators Problem (above)