Chapter 27 – Inference for Regression
- The assumptions and conditions we’ll need for regression inference
- Test about the regression coefficients
- Interpreting inference for regression
We’ve fitted regression models to the relationship between two quantitative variables before, and now we get to ask what the linear model we estimated tells us about cases other than those in our data. For that we’ll need to draw inferences.
Here’s an examples:
Look at the moon with binoculars or a telescope and you’ll see craters formed by thousand of impacts. The earth being larger has been hit even more often. Meteor Crater in Arizona was the first recognized impact crater and it was only identified as a crater in the 1920’s. With the help of satellite images more and more craters have been identified, so now more than 180 are known. And these, of course, are only a small sample of all the impacts the earth has experienced. Only 29% of earth’s surface is land, and many craters have been covered or eroded away. Astronomers have recognized a roughly 35-million year cycle in the frequency of cratering. The cause of them, however, is not fully understood.
On page 653, we can see a scatterplot (for one sample) of the known impact craters form in the most recent 35 million years. You’ll notice that in order to make the relationship simpler, both age in millions of years before the present and the diameters of the craters have been converted to logs. The plot looks straight, so we can consider fitting a linear model. So, basically, we would have a regression model for our one sample. But scientists want to know whether the coefficients in all their models are just about the data we have or do they tell us something more fundamental. For that we’ll need to test hypotheses.
What does it mean to test a hypothesis about regression coefficients?
We’ve have based all of our inference methods on the insight that the sample we have is but one random selection from a larger population. We can simulate to get an idea of what that means for regression. Let’s imagine drawing several samples from the population and see how much they vary from sample to sample. You can imagine that, even though they are from the same population, the many different scatterplots would look different and would each have its own least squares regression with its own slope.
So we’re thinking about an underlying linear relationship which because it’s the true unknown relationship, we’ll write with Greek letters:
y = β0 + β1x1 + ε
Where,
β0 is the intercept
β1 is the slope
, and because we know the line can’t actually match all the values in the population,
εis the errors
, the residuals when their part of our analysis is the data.
We’ll estimate this model with the least squares regression which we’ll write with Latin letter:
ŷ = b0 + b1x1
The slope of this regression model is the statistic and it has a sampling distribution which we can model. That means that we now have two models: the linear model that describes the relationship between the crater ages and diameters, and a model for the sampling distribution. And of course, whenever we use a model, that means that we will have assumptions and conditions that need to be checked.
Here are the assumptions and conditions. This time, they will be numbered because we have to check them in order:
- Straight Enough Condition – The relationship must be straight or a linear model makes no sense. We look at the scatterplot to see whether it’s straight enough. For our crater example, the scatterplot looks straight.
- Independence – The individual values must be independent of each other. For our crater example, that’s pretty likely.
- Homoscedasticity – The Equal Variance Assumption; Basically, “Does the Plot Thickens?” condition. The scatter of the points around the line must be consistent everywhere.
- Nearly Normal Condition – The residuals should look nearly normal. We check that with a histogram or normal probability plot of the residuals.
We can go ahead and fit the regression model as long as the straight enough condition is satisfied. We need the other three conditions to use the sampling distribution model for inference.
We always use technology to compute regressions. It’s more than you’d want to take on by hand, and every statistic program makes a table sort of like this one:
Dependent variable is : LogDiamR squared = 59.3%
S=0.620 with 37-2=35 degrees of freedom
Variable Coefficient SE(Coeff) t-ratio P-value
Intercept 0.28 0.115 2.45 0.0195
LogAge 0.48 0.068 7.14 < 0.0001
Now that we have the regression model, shown in general here:
ŷ = b0 + b1x1
, or for our example like this:
^
logdiam = 0.28 + 0.48logage
We can compute the residuals:
e = y- ŷ
, the observed values minus the corresponding predictive values, and plot them to check the last two conditions:
- Check the plot of the Residuals against the predicted values – check for additional patterns and quirks in the data by observing the plot of the residuals. You must make sure that the residuals are randomly distributed.
- Check the histograms of the residuals – Check to see that the histogram of the residuals is unimodal and symmetric, good enough for the nearly normal condition. Another option would be to check the normal probability plot of the residuals.
So for our example:
- Scatterplot (check straight enough): I examined the scatterplot to check the straight enough condition. It was straight enough, so I can fit a least squares regression model.
- Fit a least squares regression model
- Independence: I thought about independence, and I concluded the meteors whose craters I know of were most likely independent.
- Plot the Residuals vs. Predicted values: I plotted the residuals against the predicted values to check the “does the plot thicken condition”, and concluded that the residuals varied around about the same everywhere. As long as we I looking at the plot, I double-checked the straight enough condition and checked for outliers.
- Checked the histograms: I made a histogram of the residuals and concluded that they were nearly normal.
Now since the assumptions and conditions checked, I can go ahead and use statistical inference methods to check the coefficients. Here’s that regression output table again:
Dependent variable is : LogDiamR squared = 59.3%
S=0.620 with 37-2=35 degrees of freedom
Variable Coefficient SE(Coeff) t-ratio P-value
Intercept 0.28 0.115 2.45 0.0195
LogAge 0.48 0.068 7.14 < 0.0001
This is a generic example. No statistics program makes one just like this, but whatever table you’re given, it will look enough like this that you won’t get lost. I have highlighted the coefficients in red. These are necessary to write the regression model, but these are not the ones we want for inference. The ones we need for inference I have highlighted in blue: the SE(Coeff), t-ratio, and P-value. These are the ones we’re going to need.
So let’s set up the test formally.
What’s the null hypothesis? Well, to have a hypothesis, we need to have a parameter to hypothesize about.
What’s the parameter? Well, that should be easy to answer. The model is a linear model:
y = β0 + β1x1 + ε
written with Greek Beta’s; thus, the Beta’s are the parameters.
Now, let’s consider the slope, β1. What’s the right hypothesis? Well, that might not be so obvious. It’s
H0 :β1 = 0
Why is this the null? Well, this is equivalent to saying there is no linear relationship between y and x. In our example: That’s like saying that the age of impact craters is not linearly related to their size.
Continuing: We always test it against the two-sided alternative. To test this null hypothesis, we’ll need two things: a standard error, and a sampling distribution.
Dependent variable is : LogDiamR squared = 59.3%
S=0.620 with 37-2=35 degrees of freedom
Variable Coefficient SE(Coeff) t-ratio P-value
Intercept 0.28 0.115 2.45 0.0195
LogAge 0.48 0.068 7.14 < 0.0001
The regression table has a column of standard errors, bolded above. Those are what we need.
Our test statistic is pretty much what you should expect by now. Just like we did for means, we’ll take the difference between the observed statistic and the hypothesized value:
b1- β1
We’ll compare it to our usual ruler:
b1- β1
______
SE(b1)
, the standard error. And, Just like for means, under the assumptions we have already checked, this follows a t-distribution. Only now, on n-2 degrees of freedoms.
b1- β1
______~ tn-2
SE(b1)
Well, now we know how to find a probability of a difference this large: by either consulting a t-table or by looking at technology. And that’s where we find the appropriable p-values.
Dependent variable is : LogDiamR squared = 59.3%
S=0.620 with 37-2=35 degrees of freedom
Variable Coefficient SE(Coeff) t-ratio P-value
Intercept 0.28 0.115 2.45 0.0195
LogAge 0.48 0.068 7.14 < 0.0001
Looking at the table, we feel confident in rejecting the null hypothesis. t-values of 7 don’t happen by chance. So we can be confident that there is a linear relationship between crater size and age when both are re-expressed by logs, and we can estimate that relationship with this regression equation:
^
logdiam = 0.28 + 0.48logage
As always, the key to judging how far a statistic is from the hypothesized value is to find the right standard error.
For regression the standard error formula depends on 3 things:
- The standard deviation of the residuals: In other words, how much the residuals vary, (s, in the computer printout above. Sometimes you’ll see it as se. More variation in the residuals means the model isn’t pinned down that firmly, so the standard error of the coefficients is bigger.
- We need the sample size. With more data our estimates are always less variable.
- We need to know how variable the x’s are. What we mean by this is we’re balancing our line on the x’s. If the x’s are all bunched up together the line isn’t going to be very stable. But if the x’s are all spread out, they give us a solid basis on which to estimate our line. So, we want the x’s to have enough variation.
Putting those pieces together, the standard error of the coefficients look like this:
SE(b1) se
______
se √n -1
And that look about right since more variation in the residuals means we don’t know the slope that well and thus it would have a larger standard error. More variation in the x’s helps us to estimate more reliably and so does a larger sample size under the square root as it was for the mean.
How should we interpret the regression itself? Well the regression model tells us that older craters are larger on average. But be careful interpreting the regression. Regression models don’t support a causal interpretation. We cannot say that changes in x will result in changes in y of the size and direction predicted by the model. If you think about it, it makes no sense to say that increasing the age of a crater, increases its diameter. It’s just that those craters that are older are on average larger. But can we conclude that craters and we might hope meteors that hit the earth are getting smaller? Well, it is true that older craters are larger on average, but we cannot interpret the regression model as an explanation of that pattern. In fact, it’s likely that at least part of what we see is that smaller craters don’t survive the ravages of nature over the 35 million years of this study as well as the larger ones do. Maybe the relationship we’ve estimated is due to other variables influencing both age and diameter or to a biased sample presented by the geological record.
Because regression output tables always include the t-statistics and p-values, you’ll find that it’s easy to test standard null hypothesis about your regression coefficients; Maybe too easy.
Be sure you remember what we’ve talked about here:
- Only fit regression models to relationships that are straight. Check the scatterplot.
- If you want to use inference, be sure to check the other 3 assumptions and conditions. That means making a scatterplot of the residuals against the predicted values and checking the distribution of the residuals.
- Use the appropriate student’s t model for your test. Usually the computer makes that easy.
- Even when you can confidently reject the null hypothesis, don’t be drawn into thinking that there’s a causal connection between x and y. Different x values may very well to with different y values, but we cannot conclude that changing an individual x value will change the corresponding y value.