Introduction to Regression and Data Analysis

Statlab Workshop

February 28, 2003

Tom Pepinsky and Jennifer Tobin

I.Regression: An Introduction:

  1. What is regression?

The idea behind regression in the social sciences is that the researcher would like to find the relationship between two or more variables. Regression is a statistical technique that allows the scientist to examine the existence and extent of this relationship. Regression shows that given a population, if the researcher can either examine the entire population or perform a random sample of sufficient size, it is possible to mathematically recover the parameters that describe the relationships between variables. Once the researcher has established such a relationship, she can then use these parameters to predict values of a new dependent variable given a new independent variable. Regression does not make any specifications about the way that the independent variables are distributed or measured (discrete, continuous, binary, etc.), but in order for regression to be the appropriate technique, the Gauss-Markov assumptions must be fulfilled.

In its simplest (bivariate) form, regression shows the relationship between one independent variable (X) and a dependent variable (Y). The magnitude and direction of that relation are given by a parameter (β1), and an intercept term (β0) captures the status of the dependent variable when the independent variable is absent. A final error term (u) captures the amount of variation that is not predicted by the slope and intercept terms. The regression coefficient (R2) shows how well the values fit the data. More sophisticated forms of regression allow for more independent variables, interactions between the independent variables, and other complexities in the way that one variable affects another.

Regression thus shows us how variation in one variable co-occurs with variation in another. What regression cannot show is causation; causation is only demonstrated analytically, through substantive theory. For example, a regression with shoe size as an independent variable and foot size as a dependent variable would show a very high regression coefficient and highly significant parameter estimates, but we should not conclude that higher shoe size causes higher foot size. All that the mathematics can tell us is whether or not they are correlated, and if so, by how much.

B.Difference between correlation and regression.

It is important to recognize that regression analysis is fundamentally different from ascertaining the correlations among different variables. Correlation can tell you how the values of your variables co-vary, but regression analysis is aimed at making a stronger claim: demonstrating how one variable, your independent variable, causes another variable, your dependent variable. Correlation determines the strength of the relationship between variables, while regression attempts to describe the relationship between these variables. Of course, it is apparent that regression may lead to what is called “spurious correlation,” where the co-variation of two variables implies a causal relationship that does not exist. For example, we might find that there is a significant relationship between being a basketball player and being tall. Of course, being a basketball player does not cause one to become taller; the relationship is almost certainly the opposite. It is important to recognize that regression analysis cannot itself establish causation, only describe correlation. Causation is established through theory.

SPSS syntax: Analyze: correlate: bivariate:

Correlations

X / Y
X / Pearson Correlation / 1 / -.954(**)
Sig. (2-tailed) / . / .001
N / 7 / 7
Y / Pearson Correlation / -.954(**) / 1
Sig. (2-tailed) / .001 / .
N / 7 / 7

** Correlation is significant at the 0.01 level (2-tailed).

II.Before We Get Started: The Basics

A.Your variables may take several forms, and it will be important later that you are aware of, and understand, the nature of your variables. The following variables are those which you are most likely to encounter in your research.

  • Categorical variables

Such variables include anything sort of measure that is “qualitative” or otherwise not amenable to actual quantification. There are a few subclasses of such variables.

Dummy variables take only two possible values, 0 and 1. They signify conceptual opposites: war vs. peace, fixed exchange rate vs. floating exchange rate, etc.

Nominal variables can range over any number of non-negative integers. They signify conceptual categories that have no inherent relationship to one another: red vs. green vs. black, Christian vs. Jewish vs. Muslim, etc.

Ordinal variables are like nominal variables, only there is an ordered relationship among them: no vs. maybe vs. yes, etc.

  • Numerical variables

Such variables describe data that can be readily quantified. Like categorical variables, there are a few relevant subclasses of numerical variables.

Continuous variables can appear as fractions; in reality, they can have an infinite number of values. Examples include temperature, GDP, etc.

Discrete variables can only take the form of whole numbers. Most often, these appear as count variables, signifying the number of times that something occurred: the number of firms invested in a country, the number of hate crimes committed in a county, etc.

When you begin a statistical analysis of your data, a useful starting point is to get a handle on your variables. Are they qualitative or quantitative? If they are the latter, are they discrete or continuous? Another useful practice is to ascertain how your data are distributed. Do your variables all cluster around the same value, or do you have a large amount of variation in your variables? Are they normally distributed, or not?

B.We are only going to deal with the linear regression model

The simple (or bivariate) LRM model is designed to study the relationship between a pair of variables that appear in a data set. The multiple LRM is designed to study the relationship between one variable and several of other variables.

In both cases, the sample is considered a random sample from some population. The two variables, X and Y, are two measured outcomes for each observation in the data set. For example, lets say that we had data on the prices of homes on sale and the actual number of sales of homes:

Price(thousands of $) / Sales of new homes
X / y
160 / 126
180 / 103
200 / 82
220 / 75
240 / 82
260 / 40
280 / 20

And we want to know the relationship between X and Y. Well, what does our data look like?

SPSS syntax: graph: scatter: simple: enter x and y

We need to specify the population regression function, the model we specify to study the relationship between X and Y.

This is written in any number of ways, but we will specify it as:


where

  • Y is an observed random variable (also called the endogenous variable, the left-hand side variable).
  • X is an observed non-random or conditioning variable (also called the exogenous or right-hand side variable).
  • is an unknown population parameter, known as the constant or intercept term.
  • is an unknown population parameter, known as the coefficient or slope parameter.
  • u is is an unobserved random variable, known as the disturbance or error term.

Once we have specified our model, we can accomplish 2 things:

  • Estimation: How do we get a "good" estimates of and ? What assumptions about the PRF make a given estimator a good one?
  • Inference: What can we infer about and from sample information? That is, how do we form confidence intervals for and and/or test hypotheses about them.

The answer to these questions depends upon the assumptions that the linear regression model makes about the variables.

The Ordinary Least Squres (OLS) regression procedure will compute the values of the parameters and (the intercept and slope) that best fit the observations.

We want to fit a straight line through the data, from our example above, that would look like this:

Obviously, no straight line can exactly run through all of the points. The vertical distance between each observation and the line that fits “best”—the regression line—is called the error. The OLS procedure calculates our parameter values by minimizing the sum of the squared errors for all observations.

Why OLS? It is considered the most reliable method of estimating linear relationships between economic variables. It can be used in a variety of environments, but can only be used when it meets the following assumptions:

C.Assumptions of the linear regression model (The Gauss-Markov Theorem)

The Gauss-Markov Theorem is essentially a claim about the ability of regression to assess the relationship between a dependent variable and one or more independent variables. The Gauss-Markov Theorem, however, requires that for all Yi, Xi, the following conditions are met:

1)The conditional expectation of Y is an unchanging linear function of known independent variables. That is, Y is generated through the following process:

In the simple regression model, the dependent variable is assumed to be a function of one or more independent variables plus an error introduced to account for all other factors. In the regression equation specified above, Yi is the dependent variable, Xi1, ...., Xik are the independentor explanatory variables, and εi is the disturbanceor error term. The goal of regression analysis is to obtain estimates of the unknown parameters β1, ..., βk which indicate how a change in one of the independent variables affects the values taken by the dependent variable. Note that the model also assumes that the relationships between the dependent variable and the independent variables are linear.

Examples of violations: non-linear relationship between variables, including the wrong variables

2)All X’s are fixed in repeated samples

The Gauss-Markov Theorem also assumes that the independent variables are non-random. In an experiment, the values of the independent variable would be fixed by the experimenter and repeated samples could be drawn with the independent variables fixed at the same values in each sample. As a consequence of this assumption, the independent variables will in fact be independent of the disturbance. For non-experimental work, this will need to be assumed directly along with the assumption that the independent variables have finite variances.

Examples of violations: endogeneity, measurement error, autoregression

3)The expected value of the disturbance term is zero.

The disturbance terms in the linear model above must also satisfy some special criteria. First, the Gauss-Markov Theorem assumes that the expected value of the disturbance term is zero. This means that on average, the errors balance out.

Examples of violations: expected value of disturbance term is not zero

4)Disturbance have uniform variance and are uncorrelated

for all i, j

The Gauss-Markov Theorem further assumes that the variance of the error term is a constant for all observations and in all time periods. Formally, this assumption implies that the error is homoskedastic. If the variance of the error term is not constant, then the error terms are heteroskedastic. Finally, the Gauss-Markov Theorem assumes that the error term is non-correlated. More specifically, it assumes that the values of the error term at different time periods are independent of each other. So, the error terms for all observations, or among observations at different time periods, are not correlated with each other.

Examples of violations: heteroskedasticity, serial correlation of error terms

5)No exact linear relationship between independent variables and more observations than independent variables

Abs(correlation [xi,xj])  1

n  k +1

The independent variables must be linearly independent of one another. That is, no independent variable can be expressed as a non-zero linear combination of the remaining independent variables. There also must be more observations than there are independent variables in order to ensure that there are enough degrees of freedom for the model to be identified.

Examples of violations: multicolinearity, micronumerosity

If the five Gauss-Markov Assumptions listed above are met, then the Gauss-Markov Theorem states that Ordinary Least Squares regression estimator bi is the Best Linear Unbiased Estimator of βi. (OLS is BLUE.) All estimators will be unbiased, and that will have the least variance in the class of unbiased linear estimators. The formula for deriving the OLS estimator of βi is as follows.

Scalar form:

Matrix form:

The point of the regression equation is to find the best fitting line relating the variables to one another. In this enterprise, we wish to minimize the sum of the squared deviations (residuals) from this line. OLS will do this better than any other process as long as these conditions are met.

III.Now, Regression Itself

A.So, now that we know the assumptions of the OLS model, how do we estimate and ?

The Ordinary Least Squares estimates of and are defined as the particular values and of and that minimize the sum of squares for the sample data.

The best fit line associated with the n points (x1, y1), (x2, y2), . . . , (xn, yn) has the form

y = mx + b

where

So we can take our data from above and substitute in to find our parameters:

Price(thousands of $) / Sales of new homes
X / y / xy / x2
160 / 126 / 20,160 / 25,600
180 / 103 / 18,540 / 32,400
200 / 82 / 16,400 / 40,000
220 / 75 / 16,500 / 48,400
240 / 82 / 19,680 / 57,600
260 / 40 / 10,400 / 67,600
280 / 20 / 5,600 / 78,400
Sum / 1540 / 528 / 107280 / 350000

And now in SPSS: Analyze: Regression: Linear: statistics: confidence interval: Dependent variable:Y Independent Variables: Xs:

Model Summary

Model / R / R Square / Adjusted R Square / Std. Error of the Estimate
1 / .954(a) / .911 / .893 / 11.75706

a Predictors: (Constant), X

Coefficients(a)

Model / Unstandardized Coefficients / Standardized Coefficients / t / Sig. / 95% Confidence Interval for B
B / Std. Error / Beta / Lower Bound / Upper Bound
1 / (Constant) / 249.857 / 24.841 / 10.058 / .000 / 186.000 / 313.714
Price of house / -.793 / .111 / -.954 / -7.137 / .001 / -1.078 / -.507

a Dependent Variable: # houses sold

Thus our least squares line is

y = 249.857 0.79286x

Interpreting data:

Let’s take a look at the regression diagnostics from our example above:

Explaining coefficient: for every one unit increase in the price of a house, -.793 fewer houses are sold—now this doesn’t make intuitive sense in this case, because you can’t sell .793 of a house, but we could imagine this to be true for a continuous variable.

But, how do we know that our coefficient estimate is meaningful? We can do a test of statistical significance—usually we want to know if our coefficient estimate is statistically different from zero: this is called a t-test.

Formally, we say:

H0 : Bprice of a house = 0 In other words, the price of a house has no effect on house sales.

What we hope to do is to reject this hypothesis

Explaining t-statistic: Our t-statistic is simply our coefficent divided by its standard error. The next step is to compare this statistic to its critical value that can be found in a table of t-statistics in any statistics text book, or for a large enough sample, most researchers use the 95% confidence interval, with the associated critical value of 1.96. Thus, for any t-statistic below 1.96 we cannot reject the hypothesis that our coefficient estimate is significantly different from 0. Note: We never say that we can accept a hypothesis, we can simply not reject or reject hypothesis about coefficient estimates.

p-value (in SPSS Sig.)

Confidence interval

R-squared:

Finally, we want to look at the R-squared statistic from our model summary statistics above.

Formally,

Now lets do it again for a multi-variate regression, where we add in number of red cars in neighborhood as predictor of housing sales:

Model Summary

Model / R / R Square / Adjusted R Square / Std. Error of the Estimate
1 / .958(a) / .917 / .876 / 12.65044

a Predictors: (Constant), PRICE, REDCARS

Coefficients(a)

Model / Unstandardized Coefficients / Standardized Coefficients / t / Sig. / 95% Confidence Interval for B
B / Std. Error / Beta / Lower Bound / Upper Bound
1 / (Constant) / 223.157 / 54.323 / 4.108 / .015 / 72.332 / 373.983
Price of house / -.708 / .191 / -.853 / -3.703 / .021 / -1.240 / -.177
Red cars / .376 / .666 / .130 / .565 / .603 / -1.474 / 2.226

a Dependent Variable: # houses sold

Explaining coefficients:

This time, the estimate cannot be explained in exactly the same manner.

Here we would say: controlling for the effect of red cars, the marginal effect of a one unit increase in the price of houses, housing sales decreasy by –0.708 units.

Let’s look at our t-statistic on red cars, what does that tell us?

So, should we take it out of our model?

Again, let’s think about p-values, p-value (in SPSS Sig.), Confidence intervals, and R-squared.

What happened to our R-squared in this case? It increased to .958—this is good, right, we explained more of the variation in Y through adding this variable. NO.

B.Interpreting log models

The log-log model:

Slope coefficient B2 measures the elasticity of Y with respect to X, that is, the percentage change in Y for a given percentage change in X

The model assumes that the elasticity coefficent between Y and X, B2 remains constant throughout.—the change in lnY per unit change in lnX (the elasticity B2) remains the same no matter at which lnX we measure the elasticity.


The log-linear model:

B2 measures the constant proportional change or relative change in Y for a given absolute change in the value of the regressor:

B2=relative change in regressand / absolute change in regressor

If we multiply the relative change in Y by 100, will then get the percentage change in Y for an absolute change in X, the regressor.


The lin-log model:

Now interested in finding the absolute change in Y for a percent change in X

=Change in Y / relative change in X

The absolute change in Y is equal to B2 times the relative change in X—if the latter is multiplied by 100, then it gives the absolute change in Y for a percentage change in X

IV.When Linear Regression Does Not Work

A.Violations of the Gauss-Markov assumptions

Some violations of the Gauss-Markov assumptions are more serious problems for linear regression than others. Consider an instance where you have more independent variables than observations. In such a case, in order to run linear regression, you must simply gather more observations. Similarly, in a case where you have two variables that are very highly correlated (say, GDP per capita and GDP), you may omit one of these variables from your regression equation. If the expected value of your disturbance term is not zero, then there is another independent variable that is systematically determining your dependent variable. Finally, in a case where your theory indicates that you need a number of independent variables, you may not have access to all of them. In this case, to run the linear regression, you must either find alternate measures of your independent variable, or find another way to investigate your research question.