4 Regressions with two explanatory variables.
This chapter contains the following sections:
(4.1) Introduction.
(4.2) The gross versus the partial effect of an explanatory variable.
(4.3) and adjusted .
(4.4) Hypothesis testing in multiple regression.
4.1 Introduction
In the proceeding sections we have studied in some detail regressions containing only one explanatory variable. In econometrics with its multitude of dependencies, the simple regression can only be a showcase. As such it is important and powerful since the methods applied to this case, carry over to multiple regression with only small and obvious modifications. However, some important new problems arise which we point out in this chapter.
As usual we start with the linear form:
(4.1.1) ,
where we again assume that the explanatory variablesandare deterministic and also that the random disturbanceshave the standard properties:
(4.1.2)
(4.1.3)
(4.1.4)
The coefficient is the intercept coefficient, is the slope coefficient of , showing the effect on Y of a unit change in, holding constant or controlling for. Another phrase frequently used is that is the partial effect on Y of, holding fixed. The interpretation of is similar, except that and change rolls.
Since the random disturbances are homoskedastic (i.e. they have the same variances), the scene is prepared for least square regression. Hence, for arbitrary values of the structural parameters , and the sum of squared residuals are given by:
(4.1.5)
Minimizing give the OLS estimators of This is a simple optimization problem and we can state the OLS estimators directly:
(4.1.5)
(4.1.7)
(4.1.8)
(4.1.9)
In these formulas we have used the common notation:
(4.1.10) ,
We have also used the useful formulas:
(4.1.11)
(4.1.12)
where
You should check for yourself that these formulas hold.
You should also check for yourself that the OLS estimators ,, and are unbiased, i.e.
(4.1.13)
(4.1.14)
(4.1.15)
The variances and covariances of these OLS estimators are readily deduced:
(4.1.16)
(4.1.17)
(4.1.18)
Using:
(4.1.19)
The formulas for and can be rewritten
(4.1.20)
(4.1.21)
Similar calculation show that:
(4.1.22)
Evidently, many results derived in the simple regression have immediate extension to multiple regression. But some new points enter of which we have to be aware.
4.2 The gross versus the partial effect of an explanatory variable
In order to get insight in this problem, it is enough to start with the multiple regression (4.1.1). Hence, we specify:
(4.2.1)
To facilitate the discussion we assume, for the time being, that also the explanatory variables and are random variables.
In this situation the above regression is by no means, the only regression we might think of. For instance, might also consider:
(4.2.2)
or
(4.2.3)
where and are the random disturbances in the regressions of Y on and
In the regressions (4.2.1) and (4.2.2) and show the impact of on Y, but obviously and can be quite different. In (4.2.1) shows the effect of when we control for. Hence, shows the partial or net effect of on Y. In contrast, shows the effect of when we do not control for. Intuitively, when is excluded from the regression, will absorb some of the effect of on Y since there will normally be a correlation between and. Therefore, we call the gross effect of . Of course, similar arguments apply to the comparison of and in the regression (4.2.1) and(4.2.3). As econometricians we often wish to evaluate the effect of an explanatory variable on the dependent variable. Since the gross and partial impacts can be quite different, we understand that we have to be cautious and not too bombastic when we interpret the regression parameters. At the same time it is evident that this is a serious problem in any statistical application in the social science.
A comprehensive Danish investigation studied the relation between mortality and jogging. One compared the mortality for two groups, one consisted of regular joggers and the other of people not jogging. The research worker found that the mortality rate in the group of regular joggers was considerably lower compared to the group consisting of non-joggers. This result may be reasonable and expected, but was the picture that simple? A closer study showed that the joggers were better educated, smoked less and almost nobody in this group had weight problems. Could these factors help to explain the lower mortality rate for the joggers? A further study of this sample showed that although these factors had systematic influences on the mortality rate, the jogging activity still reduced the mortality rate.
In econometrics or in the social sciences in general similar considerations relate to almost any applied work. Therefore, we should like to shed some specific light on the relations between the gross and partial effects. Intuitively, we understand that the root of this problem is due to the correlation between the explanatory variables, in our model the correlation between and . So when is excluded from the regression, some of the influence of on is captured by .
In order to be concise, let us assume that
(4.2.4)
where denotes the disturbances term in this regression. ((4.2.4) shows why it is convenient to assume that are random variables in this illustration).
Using (4.2.4) to substitute for in (4.2.1) we attain:
(4.2.5)
Comparing this equation (4.2.2) we attain:
(4.2.6)
(4.2.7)
(4.2.8)
We observe immediately that if there is no linear relation between , then the gross effect coincides with the partial (net) effect since in this case (see (4.2.8)).
The intercept term will be a mixture of intercepts and and the partial effect of , namely.
Equations (4.2.7) - (4.2.8) show the relations between the structural parameters, we have still to show that OLS estimators confirm these relations. However, they do!
Let us verify this fact for (see (4.2.8)). We know from (4.1.7) and (4.1.8) that:
(4.2.9)
(4.2.10)
Where and
From section (4.2) we realize that:
(4.2.11)
(4.2.12)
where are obtained by regressing on
Similarly, by regressing (4.2.4) we attain:
(4.2.13)
(4.2.14)
Piecing the various equations together we obtain:
(4.2.15)
Thus, we have confirmed that:
(4.2.16)
so that the OLS estimators satisfy (4.2.8). In a similar way we can show that:
(4.2.17)
verifying (4.2.9).
We also observe that implies that (see (4.2.14)). By (4.2.16) in this case we have that .
Therefore, if and are uncorrelated, then the gross and partial effects of coincide. The specification issue treated in this section is important and interesting, but at the same time challenging capable of eroding any econometric specification. Many textbooks treat it under the heading “Omission variable bias”. In my opinion treating this as a bias problem is not the proper approach. In order to substantiate this view, we take (4.1.1) as the starting point but now assuming that are random. Excluding details we simply assume that the conditional expectation of can be written:
(4.2.18)
If our model is incomplete in that has not been included in the specification, then, evidently the expression the conditional expectation can be written:
(4.2.19)
In general can be an arbitrary function of . However, if we stick to the linearity assumption by assuming:
(4.2.20)
equation (4.2.19) will lead to the regression function:
(4.2.21)
where are expressed by (4.2.7) and (4.2.8).
The point of this lesson is that (4.2.18) and (4.2.21) are simply two different regression functions, but on their own perfectly all right regressions. To say that is in any respect biased is simply misuse of language.
4.3 and the adjusted
In section (2.3) we defined the coefficient of determination . We remember:
(4.3.1)
where: Explained sum of squares is equal to: . Total sum of squares is equal to: . Sum of squared residuals is equal to:
Since never decreases when a new variable is added to a regression, an increase in does not imply that adding a new variable actually improves the fit of the model. In this sense the gives an inflated estimate of how well the regression fits the data. One way to correct for this is to deflate by a certain factor. The outcome will be the so called adjusted , denoted .
The is a modified version of that does not necessarily increase when a new variable is added to the regression equation. is defined by:
(4.3.2)
N-the number of observations
k-the number of explanatory variables
There are a few things to be noted about . The ratio is always larger than 1, so that is always less than . Adding a new variable has two opposite effects on . On the one hand, the SSR falls which increases. On the other hand the factor increases. Whether increases or decreases, depends on which of the effects are the stronger. Thirdly, an increase in does not necessarily mean that the coefficient of the added variable is statistically significant. To find out if an added variable is statistically significant, we have to perform a statistical test for example a t-test. Finally, a high does not necessarily mean that we specified the most appropriate set of explanatory variables. Specifying econometric models are difficult. We face observability and data problems around any concern, but, in general, we ought to remember that the specified model should have a sound basis in economic theory.
4.4 Hypothesis testing in multiple regression
We have seen above that adding a second explanatory variable to a regression did not demand any new principle as regard estimation. OLS estimators could be derived by an immediate extension of the “one explanatory variable” case. Much of the same can be said about hypothesis testing. We can, therefore, as well start with a multiple regression containing k explanatory variables. Hence we specify:
(4.4.1)
where, as usual, denotes the random disturbances.
Suppose we wish to test a simple hypothesis on one of the slope coefficients, for example . Hence, suppose we wish to test:
(4.4.2)
In chapter 3.1 we showed in detail the relevant procedures for testing against these alternatives in the simple regression. Similar procedures can be applied in this case. We start with test statistic
(4.4.3) ~ t-distributed with degrees of freedom when is true.
In (4.4.1) denotes the OLS estimator and is an estimator of the standard deviation of .
In the general case (4.4.1) we only have to remember that in order to get an unbiased estimator of (the variance of the disturbances) we have to divide by . Note, in the simple regression so that reduces to (N-2). By similar arguments we deduce that the test statistic T given by (4.4.3) has a t-distribution with (N-k-1) degrees of freedom when the null hypothesis is true. With this modification we can follow the procedures described in chapter (3.1). Hence, we can apply the simple t-tests but we have to choose the appropriate number of degrees of freedom in the t-distribution; remember that fact!
The t-tests are not restricted to testing simple hypothesis on the intercept or the various slope parameters, t-tests can also be used to test hypothesis involving linear combinations of the regression coefficients.
For instance, if we wish to test the null hypothesis:
(4.4.4)
We realize that this can be done with an ordinary t-test. The point is that these hypotheses are equivalent to the hypotheses:
(4.4.5)
So that if we reject we should also reject , etc.
In order to test against ,we use the test statistic:
(4.4.6)
When is true, T will have a t-distribution with (N-k-1) degrees of freedom. Note that can be estimated by the formulas:
(4.4.7)
where
(4.4.8)
When estimates of and and are available, we can easily compute the value of the test statistic T. After that we continue as with the usual t-tests.
Although, the t-tests are not solely restricted to the simple situations, we will quickly face test situations that these tests can not handle. As an example we consider model from labor market economics: suppose that wages depend on workers’ education and experience . In order to investigate the dependency of on and , we specify the regression:
(4.4.9)
where denote the usual disturbance terms.
Note that the presence of the quadratic term does not create problems for estimating the regression coefficient. It is hardly a small hitch. We only have to define the new variable:
(4.4.10)
The regression (4.4.9) becomes:
(4.4.11)
Suppose now that we are uncertain whether workers’ experience has any effect on the wages . In order settle this issue we have to test a joint null hypothesis, namely:
(4.4.12)
In this case the null hypothesis restricts the value of two of the coefficient, so as a matter of terminology we can say that the null hypothesis in (4.4.12) imposes two restrictions on the multiple regression model; namely . In general, a joint hypothesis is a hypothesis which imposes two or more restrictions on the regression coefficients.
It might be tempting to think that we could test the joint hypothesis (4.4.12) by using the usual t-statistics to test the restriction one at a time. But this testing procedure will be very unreliable. Luckily, there exist test procedures which manage to handle joint hypothesis on the regression coefficients.
So, how can we proceed to test the joint hypothesis (4.4.12)? If the null hypothesis is true, the regression (4.4.11) becomes:
(4.4.13)
Obviously, we have to investigate two regressions, the one given by (4.4.11) and the other one given by (4.4.13). Since there are no restrictions on (4.4.11) it is called the unrestricted form, while (4.4.13) is called the restricted form of the regression. It is very natural to base a test of the joint null hypothesis (4.4.12) on the sum of squared residuals resulting from these two regression. If denotes this sum of squared residuals obtained from (4.4.13) and dente that obtained from (4.4.11), we will be doubtful about the truth of if is considerably greater than . If is only slightly larger than there is no reason to be doubtful about .
Since stems from the restricted regression (4.4.13), we obviously have:
(4.4.14)
In order to test joint hypotheses on the regression coefficients the standard approach is to use a so-called F-test. In our present example this test is very intuitive. In the general case it is based on the test-statistic:
(4.4.15)
where r denotes the number of restrictions and k denotes the number of explanatory variables.
In our example above: r =2 and k =3.
If is true, then F has a so-called Fisher distribution with (r, N-k-1) degrees of freedom. The numerator has r degrees of freedom, and the denominator (N-k-1).
From (4.4.14) it is obvious that the test-statistic F is concentrated on the positive axis. Small values of F indicate that is agreeable with the sample data.