Simple Linear Regression Analysis (Chapter 10)

Objective: To explain, by performing regression analysis, the observed values and dispersion (variations) of a variable, Y, known as the dependent/explained variable, by using a linear model based on another variable,X, known as the explanatory/independent variable. How good a model is statistically is checked by verifying that observed variations (differences) from the linear model are statistically insignificant—this is known as residuals analysis.

How do we explain a linear relationship? By statistically estimating the intercept, b0, and slope, b1, that make the regression line:

Y = b + m*X + e, where e = residual.

The above regression line is estimated statistically using a random sample of paired data (x,y) that yields statistical estimates of the true slope, 1, and intercept, 0 of the true linear relation that underlies Y and X:

Y = 0 + 1*X + 

where ’s are a sequence of independent (no autocorrelation) and identically distributed, Normal random variables with mean,  = 0, and standard deviation =  (homoskedasticity—no changes in error variation associated with X). If the mean were not 0, then, the linear model would fail to be “consistent” in explaining the value of Y via the value of X. And if the errors were not all of the same standard deviation, then the estimated slope and intercept coefficients would be correlated to the value of X, and thus be biased and inconsistent, leading to the linear model being bogus.

How do we compute b and m? We statistically estimate the true intercept, 0, and the true slope, 1, by using a procedure known as Ordinary Least Squares Estimation. This procedure identifies the two numbers, b and m, that make the regression’s sample residuals be the smallest (least) possible so that these estimated coefficients, b and m, maximize the explanatory power of X about Y, while meeting inasmuch as possible, the properties that the residuals must abide by: (i) normal distribution, (ii) independence, (iii) homoskedasticity and (iv) no autocorrelation. MSExcel implements this procedure automatically when we make use of the regression command in the Data Analysis menu.

How do we assess the regression?

(i)By evaluating if the regression line “fits” the data well (using the coefficient of determination, R2 ≈ 1) and

(ii)That the sparseness of the actual data points from the regression line is minor (using a scatter plot of (X,Y) versus the regression line, and using the Standard Error of the Regression → 0). Both of these Regression Statistics are computed by “Tools| Data Analysis | Regression”.

(iii)Also, we can test a hypothesis to ensure that the regression line itself is statistically significant (Ho: b=m=0 vs. H1: m≠ 0)) using a type of hypothesis test known as an “F-test” performed using a procedure known as Analysis of Variance (ANOVA). Ideally, MSExcel should report a small p-value (≈ 0) for this test.

(iv)Lastly, MSExcel computes a coefficient table that reports the estimated intercept and slope for the regression line, their respective standard errors, significance test statistics and corresponding p-values for the hypotheses that m ≠ 0, and intercept b≠ 0. Again, for these estimates to be statistically significant we want MSExcel to report small p-values (≈ 0) in the Regression Table.

How do we assess the residuals?

(i)Normality of residuals requires that we construct a histogram of the residuals, or a Box-Whisker Plot of the residuals, or that we construct a Normal Probability Plot of the residuals with the assistance of MSExcel.

(ii)Autocorrelation of the residuals can be tested by requesting from MSExcel’s regression output that it compute the Durbin-Watson statistic from the residuals, and that this statistic be meaningfully fall within the proper range of values according to Table E-8 in your textbook.

(iii)Homoskedasticity can be inferred from the plot of residuals: if the residuals plot shows a pattern of regularity between X and e, then, the variance of residuals is probably not constant.

What if regression or residuals fail the above tests?

Depending on which of the above seven tests/diagnostics failed, we need to employ a variety of corrective tools so we can use the regression results for purposes of inferring/explaining Y by using X.

Alternatively, failure of these tests/diagnostics may be indicative that “X does not really explain Y”, that the relationship between the variables is coincidence, not structural.

The next set of notes on regression analysis includes a complete regression analysis for a data set using the output from a MSExcel regression. It particularly describes the consequences of failures of the aforementioned tests. It also touches on what possible solutions there may be to the problems with bad residuals or with poor regression fits.