AP Statistics: R2 Explained

BVD: Chapter 8

Questions:

1. Imagine if two variables were PERFECTLY correlated. In other words, they were perfectly linearly associated. How much of the variation in the response variable would be explained by the linear model?

2. What if there were NO linear association between the explanatory variable and the response variable. How much of the variation in the response variable could be explained by the linear model?

3. What about the scatterplot below—how much of the variation in the response variable could be explained by a linear model?

4. Fortunately, there is a calculation that can be done that will tell us the percent of variation in the response variable that can be explained by the linear model. It’s called r-squared, and yes, it literally is the square of the correlation coefficient. The following pages attempt to show how r-squared becomes this measurement.
5. The scatterplot below shows the gestation in days of several animals compared to their average longevity (lifespan) in years. Is a linear model appropriate? Looking at the residual plot, we will assume a linear model is appropriate.

6. How much TOTAL variation would we say exists in the response variable (longevity)?

Variation from WHAT, you might ask? Well…how do we normally calculate variation among data? Each data point varies a certain amount from the ______of the data, right? And the first calculation we learned that measures variation from the ______in a data set was the ______.

So that’s what we will calculate: the ______of longevity. How can we do that if we don’t have the data? Turn to the next page and see if you can calculate it from what you see…

7. But wait…a picture might be in order first. See if you can tell how the scatterplot below can help us visualize this calculation we are about to do. (We want to calculate the variance of longevity, which essentially--thinking of the formula--is the ______of the ______of the ______between each ______and the ______of longevity, divided by ______.) Make a couple of additions to the graph below that indicates you know what you are doing…

8. Now use the chart below to calculate the variance of longevity. VAR = ______

This number could be considered to be the “total” variation in longevity. “Variation” in longevity can be calculated to be a smaller number, if we think just a little. Looking at the scatterplot above, you might realize that one reason we got such a “large” variance is because we calculated the variation against the MEAN of longevity. We could calculate a smaller variance if we measured it against something different. Something that would “fit” the data better. Any guesses here?

9. RIGHT! What if we measured the variation of longevity against the least squares line? Surely there would be less variation since this line has a slant and seems to “fit” the data better. The mean line we measured against earlier was horizontal and thus produced more variation from each data point.

Maybe side-by-side, you can even tell that the total variation (sum of squares, etc.) will be less if you measure variation against the least squares line instead of against the mean. Compare the graphs below:

So now the question is, how much of the total variation in longevity has been “accounted for” or “explained” by the linear model? One way to answer that question is to first figure out how much variation is still present, even WITH the linear model. The variation still present can be found by looking at the ______.

10. All we need to do to calculate the amount of variation (variance) still around is to calculate the variance of the residuals of longevity:

11. So how much variation is “still around?” (variance of residuals of longevity) ______

How much variation is there total? (variance of longevity) ______

What PERCENT of total variation is “still around?” ______

Therefore, what percent of variation must have been accounted for/soaked up/explained by the linear model (the least squares regression line)?

______

This number is r-squared. Go back a page and find r-squared printed next to the least squares equation (below the scatterplot). Notice that it is ______!!

12. Based on this lesson, write an interpretation of r-squared in context of this data: