Yale University Social Science Statistical Laboratory
Introduction to Regression and basic data Analysis
A StatLab Workshop.
We are only going to deal with the linear regression model
The simple (or bivariate) LRM model is designed to study the relationship between a pair of variables that appear in a data set. The multiple LRM is designed to study the relationship between one variable and several of other variables.
In both cases, the sample is considered a random sample from some population. The two variables, X and Y, are two measured outcomes for each observation in the data set. For example, lets say that we had data on the prices of homes on sale and the actual number of sales of homes:
Price(thousands of $) / Sales of new homesX / y
160 / 126
180 / 103
200 / 82
220 / 75
240 / 82
260 / 40
280 / 20
And we want to know the relationship between X and Y. Well, what does our data look like?
We need to specify the population regression function, the model we specify to study the relationship between X and Y.
This is written in any number of ways, but we will specify it as:
where
- Y is an observed random variable (also called the endogenous variable, the left-hand side variable).
- X is an observed non-random or conditioning variable (also called the exogenous or right-hand side variable).
- is an unknown population parameter, known as the constant or intercept term.
- is an unknown population parameter, known as the coefficient or slope parameter.
- u is is an unobserved random variable, known as the disturbance or error term.
Once we have specified our model, we can accomplish 2 things:
- Estimation: How do we get a "good" estimates of and ? What assumptions about the PRF make a given estimator a good one?
- Inference: What can we infer about and from sample information? That is, how do we form confidence intervals for and and/or test hypotheses about them.
The answer to these questions depends upon the assumptions that the linear regression model makes about the variables.
The Ordinary Least Squres (OLS) regression procedure will compute the values of the parameters and (the intercept and slope) that best fit the observations.
We want to fit a straight line through the data, from our example above, that would look like this:
Obviously, no straight line can exactly run through all of the points. The vertical distance between each observation and the line that fits “best”—the regression line—is called the error. The OLS procedure calculates our parameter values by minimizing the sum of the squared errors for all observations.
Why OLS? It is considered the most reliable method of estimating linear relationships between economic variables. It can be used in a variety of environments, but can only be used when it meets the following assumptions:
Insert Assumptions:
Regression (Best Fit) Line
So, now that we know the assumptions of the OLS model, how do we estimate and ?
The Ordinary Least Squares estimates of and are defined as the particular values and of and that minimize the sum of squares for the sample data.
The best fit line associated with the n points (x1, y1), (x2, y2), . . . , (xn, yn) has the form
y = mx + b
where
So we can take our data from above and substitute in to find our parameters:
Price(thousands of $) / Sales of new homesx / y / xy / x2
160 / 126 / 20,160 / 25,600
180 / 103 / 18,540 / 32,400
200 / 82 / 16,400 / 40,000
220 / 75 / 16,500 / 48,400
240 / 82 / 19,680 / 57,600
260 / 40 / 10,400 / 67,600
280 / 20 / 5,600 / 78,400
Sum / 1540 / 528 / 107280 / 350000
Thus our least squares line is
y = 0.79286x + 249.857
Now let’s do this in a much easier manner in SPSS:
Interpreting data:
Let’s take a look at the regression diagnostics from our example above:
- Interpreting log models
The log-log model:
Slope coefficient B2 measures the elasticity of Y with respect to X, that is, the percentage change in Y for a given percentage change in X
The model assumes that the elasticity coefficent between Y and X, B2 remains constant throughout.—the change in lnY per unit change in lnX (the elasticity B2) remains the same no matter at which lnX we measure the elasticity.
The log-linear model:
B2 measures the constant proportional change or relative change in Y for a given absolute change in the value of the regressor:
B2=relative change in regressand / absolute change in regressor
If we multiply the relative change in Y by 100, will then get the percentage change in Y for an absolute change in X, the regressor.
The lin-log model:
Now interested in finding the absolute change in Y for a percent change in X
=Change in Y / relative change in X
The absolute change in Y is equal to B2 times the relative change in X—if the latter is multiplied by 100, then it gives the absolute change in Y for a percentage change in X
1