Regression and Correlation In-Class Demonstration
Overview
Multiple regression is a very flexible methodology that is used widely in the social and behavioral sciences. Using SPSS and data from the 2002 GSS, I will demonstrate some of the uses of simple and multiple regression.
Operationalization of Status Attainment Theory
Let’s suppose we were interested in researching the determinants of education. Imagine that we developed a status attainment theory of advantage in educational achievement across generations of family members. We could look at whether a respondent’s education attainment was related to his or her father’s status attainment:
The following figure shows a scatter diagram of a respondent’s highest year of school completion against the respondent’s father’s highest year of school completed.
Notice that there is a positive relationship between these two interval-ratio variables. High values of respondent’s education are associated with high values of father’s education and low values of respondent’s education are associated with low values of father’s education. The relationship looks roughly linear.
We can estimate a regression equation which models the linear relationship between respondent’s education and father’s education. A respondent’s education is more likely to be the independent variable, so will be treated as such. The simple regression for the effect of father’s education on respondent’s education yields the following statistics:
Model Summary
Model / R / R Square / Adjusted R Square / Std. Error of the Estimate1 / .421(a) / .177 / .177 / 2.636
a Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER
Notice that the results include a number of statistics that we have already talked about, including Pearson’s correlation coefficient (r), the coefficient of determination (r2), sum of square residual, the sum of square total, an estimated intercept (labeled “Constant”) and an estimate of slope of father’s education.
How do we interpret each of these statistics?
There are also a number of other statistics that we have not yet discussed. Among these is the standard error and the t statistic. These are used to determine whether the effect of each independent variable in the regression model has a statistically significant effect on the dependent variable. We will discuss them in more detail in later lectures.
Multiple Regression
The real world is complex and a single variable is often not sufficient to explain variation in another variable. Regression analysis is useful because it is flexible enough to be expanded to simultaneously estimate the independent effects of a number of independent variables on a single dependent variable. This type of extension of the simple regression model is referred to as multiple regression.
Suppose that we thought that other factors, in addition to the effect of father’s educational attainment, effected the respondent’s educational attainment. Let’s say that we thought that education was related to the age of the respondent. We can simultaneously look at the effects of respondent’s age and father’s education attainment on respondent’s educational attainment. The scatter plot of all three variables would look like the following:
This plot shows the educational attainment of the respondent plotted on the Y axis and the father’s educational attainment and age of the respondent plotted on the X and Z axes, respectively. Notice that a regression line is no longer adequate for modeling this data. Since it is three dimensions, we need a regression plane.
The dimensionality of the data will change with each additional independent variable. Beyond three dimensions, it is difficult for us to imagine the dimensionality of the data. In general, the regression reduces error in a p-dimensional space, where p is the number of independent variables in the model. We can estimate the multiple regression equation for the effects of respondent’s age and father’s education attainment on respondent’s educational attainment:
The output shows two regression equations. The first is the original regression showing the effect of father’s education attainment on respondent’s educational attainment. The second shows the effects of both respondent’s age and father’s education attainment on respondent’s educational attainment.
Notice the changes in the statistics. What happened to the value of R-Square when we added the effect of age? Did the effect of the intercept or the slope of father’s education change? Why is that?
Nonlinear Effects
Regression models measure the linear association between independent variables and a single dependent variable. However, sometimes the relationship between two variables is not linear. Let’s look at the relationship between respondent’s education and respondent’s age:
Notice that for younger respondents (recalled that the GSS collects data on anyone 18 years or older) there is an apparent positive relationship between age and education. For older respondents, there is an apparent negative relationship.
Why do you think that this is the case?
Regression models are flexible enough to handle nonlinear effects. For instance, one can include a second order polynomial (a squared term) to model a curvilinear relationship.
In this case we can construct an age square term by creating a new variable that equals the square of the age variable.
We can then estimate the effect of age including the age square term:
The output shows results for three equations. The first two have already been discussed previously. The third shows a model which adds an age square term. How does the addition of an age square term change the model estimates?
To understand the effect of age we need to consider the both the age and age square term. If we made a table of age and age square, it would look like the following (recall the age ranges from 18 to 89):
Age / Age Square18 / 324
19 / 361
20 / 400
21 / 441
22 / 484
23 / 529
24 / 576
25 / 625
26 / 676
27 / 729
28 / 784
29 / 841
30 / 900
31 / 961
32 / 1024
33 / 1089
34 / 1156
35 / 1225
36 / 1296
… / …
89 / 7921
If we weighted the age and age squared data by their slope estimates and we graphed the result, we would get the following graph:
Notice that in this way we can model nonlinear effects in a regression model.
Ordinal and Nominal Independent Variables
Regression analysis is flexible enough to include the effects of interval-ratio as well as nominal and ordinal variables. However, nominal and ordinal variable can only be included in the regression model as a series of mutually exclusive and exhaustive dichotomous variables. Why do you think this is the case?
Let’s suppose that we wanted to use a measure of social class to examine the determinants of educational attainment. The GSS includes data on a respondent’s social class, which is measured as an ordinal variable:
Examining the scatter diagram of social class and educational attainment would be difficult to interpret, so it would be better to split the file by the class variable and to look at separate histograms of educational attainment for each class:
Notice that there are differences in the distribution and in the average educational attainment for different classes. Roughly, the higher the respondent’s social class the more education the respondent is likely to have achieved.
In order to include a measure of social class in the model, we have to turn the class variables into a series of dichotomous variables. Since there are four classes, we can make four separate variables for class (LOWER CLASS, WORKING CLASS, MIDDLE CLASS, and UPPER CLASS). Using the recode procedure in SPSS, I created these four variables:
These variables collectively can be used in the regression to measure the effect of social class. However, having information on just three of these four variables is sufficient information to be able to predict the value of the fourth variable. Entering all four variables into the regression model would introduce redundant information. Because the variables are a prefect linear combination, including all of them in a model would make it impossible for the model to estimate properly.
Therefore, we purposely exclude on category, which becomes a reference category. The effects of non-excluded variables are interpreted in reference to the omitted category. In the following example, I exclude “middle class”, and I interpret the effects of the non-excluded variables as the average difference between the variable and the reference category (in this case “middle class”).
The regression results are the following:
What happened to the values of the fit statistics and in slope and intercept when the effect of social class was included in the model?
Model Building
The effect of any variable that is not included in the regression equation becomes part of the error term, an unobserved variables which can be estimated in the regression. Just as important as what is estimated in the model is what is not estimated in the model. Given this, how do we decide what to include in the model?
3