Summary of Simple Regression

Summary of Simple Regression

I. An example: Energy Usage and Real GDP

Sometimes we are interested in the relation between two variables only, say X and Y. Naturally, this does not happen often. Most relationships in economics involve many variables; not just two. However, it can be, at times, that we are sure of the close and near exclusive relation between only two variables. It may be that we need a precise quantitative expression for this relation. If the functional relation between the two variables is stable and near linear, then we can use simple linear regression to approximate and estimate this relation.

Consider total energy consumption (business and household) and real GDP for the US economy. Most people would feel that there is certainly a strong relation between these two variables. For example, the arguments over man-made global warming, along with the controversy about how such warming may be lessened, are ultimately tied to the relation between output and energy usage. Naturally, it may be argued that the relation, if it exists, is not stable, due to changing prices, capital accumulation, technology, and conservation. However, over a short period, say five years, these factors may not be changing so quickly. A reasonable and useful estimate of the simple two-variable relation may still be possible.

Regressing quarterly data for the log of energy usage (E) on the log of real GDP (Y) gives the following estimated equation, after some adjustment due to seasonal factors[1]:

where the sample involves 22 observations from the period 2001:1 ~ 2006:2.

The estimated regression shows that the income elasticity of energy usage is 0.25. Thus, a 10% increase in real income will raise the quantity of energy used by 2.5%. One of the reasons for the small size of this elasticity is that the rise in income during the early 21st century in the US was in new and greener industries, rather than in heavy industries that require large amounts of energy. The same cannot be said for China and India during this time. Also, the years chosen in the sample were years of first economic recession and then economic recovery. Therefore, much of the increase in output could have been due to a return to work of formerly unemployed workers. Hence, less energy would have been required for the output to increase. It is important to look at the sample and see if there are plausible reasons for the results which the regression gives. The sample period is especially short and therefore the estimates from the regression are less precise. However, the short sample also frees us from worrying about massive structural changes in the economy which might make the regression unstable and not very representative. Sometimes a smaller amount of data can be a good thing.

The point of this exercise is to show that we can use simple regression to get a quick and rough answer to an important question. We were wondering how real output is related to total energy usage in the economy. The Kyoto Protocol, along with other efforts to rein in (what some say is) man-made global warming, are movements to reduce greenhouse gas emissions by limiting economic activity. Supporters of these international accords see the problem as follows:

where Y = output, E = energy usage, GHG = greenhouse gases (mainly CO2) and GW = global warming. The small regression above verifies the relation between output and energy and provides us with a quantifiable measure of the amount of energy usage that is produced by increases in output. The second link in the causal chain above is much more complex. The reason for this is simple. There are many ways in which GHGs are created and some of these have nothing to do with Mankind. It follows that we cannot even begin to think of running a simple regression of E on GHGs, even if we had very good data. The last link in the chain is even more controversial. The reason for this is that the relation of GHGs to GW can only be shown using complicated nonlinear simulation models. Most scientists believe that GHGs are responsible for GW because all other explanations are worse. Those that say that warming is due to changes in the sun must contend with the fact that US satellite data have shown no discernable changes in the amount of radiation put out by the sun. There are of course alternative theories being advanced everyday, but none of these have been persuasive enough to alter the scientific consensus. It should be pointed out that GW also causes GHGs through a number of different channels making it even more difficult to pin down what is actually happening.

Now that we have seen an example of simple regression, let’s look at the theory.

II. Simple Regression: The Model and Assumptions

All research begins with some kind of theoretical model relating the variables together in some systematic and stable way. Suppose that we have some variable Y which we feel is closely related to another variable X. We further assume that this relation is stable. We may proceed to write this in a very general way as

When we assume this general function, we are not saying that X is the only thing affecting Y. We are merely saying that it is the predominant factor and that all other factors may be safely ignored or subsumed. This is seldom the case in real life. Usually, there are numerous factors affecting Y and our regression is trying to assess the relative contribution of each factor to the determination of Y. In regression there is always the danger that we have omitted an important variable from the model.

Next, we can linearize this function as

and hope that we have not done too much damage by choosing a linear function instead of a nonlinear function. In some cases we might want to take the natural logarithms of the variables first. Thus our function might be instead

which again is linear in the parameters and . Linear regression always refers to linearity in the coefficients or parameters () – not the variables or data (Y and X). Sometimes we take the log of only one of the variables in the regression.

However, if we choose the following regression, we are no longer considering linear regression

since the presence of makes the regression nonlinear in the parameters (i.e., the ). Sometimes an X-Y scatterplot can help one decide whether the variables need to be first transformed or whether a nonlinear regression is perhaps appropriate.

Next, because we have linearized the function and because we have possibly left out other variables from the equation, we include an error term () which is a random variable having a particular pdf[2]. We write the regression equation as

where the variables may have already been transformed by taking logs.

Now suppose that we observe both Y and X a total of N times. This means that the number of observations is N. We can write the regression equation as follows

where t = 1,2,…,N. This is our empirical regression model and it is our intention to estimate and for t = 1,2,…N using the data Yt and Xt for t = 1,2,…N.

We can put our data in the form of a table:

t / Xt / Yt
1 / X1 / Y1
2 / X2 / Y2
3 / X3 / Y3
. / . / .
. / . / .
. / . / .
N / XN / YN

If the symbol “t” in the table above represents time, then the order of the data is very important. Time series data requires special treatment in regression theory. We will discuss that later. If the data is cross sectional then the order of the data will not be important.

Note that we only have the data and the hypothesized model. The parameters in the model (β’s) are all unknown and unknowable. We will never know the true values of the parameters, but we can use the data to estimate them. We will be particularly interested in the signs of the parameters – positive or negative. Our theory often gives us a qualitative guide to what these signs should be. For example, our theory may tell us that Y and X have a negative relation. We would expect that the slope coefficient β1 is negative. We can then check our estimated value of β1 to see if it is in fact negative. Such exercises influence our confidence in a particular model.

We next make certain key assumptions about the random error term, ε.

I. The first assumption is that E[ε] = 0. This says that the ONLY changes in E[Y] come from changes in the model’s variable X. Another way to say this is to say that the parameters (β’s) are constant. So, our first assumption is really asserting that there is no structural change in the model. The first assumption is usually thought to be rather unimportant, but in fact it is an essential assumption in the model.

II. The second assumption is that var[ε] = σ2 and is constant regardless of t. This assumption does not relate directly to estimation per se, but is critical when one tests hypotheses about the parameters in the model. The size of σ2 will determine how precise our estimators are and not what their expected value will be. A large σ2 will result in estimators which will have large variances in repeated samples. Another way of saying this is that a large σ2 means that our model will not explain much of the variation in Y about its mean. The model moves very little with Y, whereas ε moves a great deal with Y. This makes it difficult to estimate the model with precision. On average we will do well, but if we try to estimate the model using several data sets, we will find that our estimates will vary greatly with the different samples. Note that if assumption two is not true, then var[ε] changes. This opens a whole new problem of trying to determine what affects the variance of ε. Addressing this is called the problem of dealing with heteroskedasticity. We will talk about this later.

III. The third assumption asserts that for t≠s. This assumption tells us, whatever we have left out of the model cannot not consist of anything predictable over time. This would include omitted variables, movements to equilibrium, and structural changes. If the model is changing slowly, and we do not adapt our model to this change, the third assumption will not be satisfied. If an important variable, which is itself predictable, is omitted, it will appear in the error term and the third assumption will not be satisfied. If Y and X are moving towards some kind of long run equilibrium, and the regression equation represents this long run equilibrium, then the third assumption will not be satisfied. The third assumption almost never holds in econometrics when we use time series data. This problem, called autocorrelation, has led econometricians to find ways to detect such changes and incorporate them into the model.

IV. The fourth and last assumption is that for all t and s. This assumption is crucial, if we are to employ least squares as the method of estimation. Note how that if X andεare not correlated, then changing X will change Y by β1X on average and therefore it will be relatively easy to guess the size of β1. By contrast, ifεchanges every time X changes, then it will be much harder to guess the size of β1. This is the intuition involved in assumption 4. Assumption 4 tells us that in estimating β1 we should extract from movements in X all possible information about Y’s movements. Whatever is left must not be predictable using movements in X and therefore X and the residuals should not be correlated at all.

We can summarize the assumptions using the following list:

A1: E[εt] = 0

A2: var(εt) = σ2

A3: cov(εt, εs) = 0 for t ≠ s

A4: cov(εt, Xs) = 0 for all t and s

This list should be memorized with A1 and A4 being especially noted.

III. Simple Regression: Estimation

Now that we have the model, assumptions A1~A4, and our data table, we may proceed to estimate the model. We use ordinary least squares to estimate the model. Given our assumptions, this can be seen equivalently using the method of moments, which requires no calculus.

Our model is

for t = 1,2,…,N. Therefore, it seems perfectly reasonable that the estimated residuals must ultimately be defined as

Now, assumption A1 states that E[] = 0 for all t = 1,2,…,N. We can therefore apply this to our residuals by forcing

(*)

Assumption A4 states that cov() = 0 for all t and s. We can apply this to our residuals by forcing

(**)

Substituting into (*) results in