Chapter 6 Multivariate Regression

Chapter 6 Multivariate Regression

In this chapter, we extend simple regression analysis to include more than one explanatory (i.e. right-hand side) variable. This type of regression is called multivariate regression, rather than simple regression, which has only one explanatory variable. The basic idea of using more than one explanatory variable in a multivariate regression can be seen by using a nesting of simple regressions. Here is how we can show this.

Assume you have the multivariate regression model

for t = 1, 2, 3, ..., T. We can actually estimate this equation using simple regression in two steps. In the first step, regress on and get the residual, say and then regress on and get the residual, say . Next, in the second step, regress on without a constant.[1] This can be written as

This last simple regression yields estimates for .The reason for this is that the original multivariate regression model above can be written as

which is

which can be written as

In other words, we can estimate the multivariate model by using simple regression in an iterative and informative way. Multivariate regression (as shown above) is where we first predict, as best we can, both and using the variable . We then predict the remaining unexplained part of using the remaining unexplained part of . The coefficient represents the impact of this remaining part of on the remaining part of . This is how we can understand the coefficients in a multiple regression. The coefficient measured the impact of on , after deducting out (or partialling-out) all can be explained of the two variables ( and ) using all the other variables in the model. What can be done for two variable models can be done for three variable models, four variable models, etc.

Of course, gretl does not estimate multivariate models this way. It would be too cumbersome. It will not be necessary to show HOW gretl estimates the coefficients in a multivariate regression. Just as you probably are not able to explain how a car works in detail, you can nevertheless drive a car with ease and confidence. Similarly, you will be able to use gretl to estimate and evaluate multivariate regressions with ease and confidence.

Let’s consider an example in detail using gretl and highlighting theoretical and statistical points along the way.

Example 1: (Cross Sectional Data) In Appendix 1 to this Chapter, I have collected data on the percentage of votes Donald Trump won for each state in the US 2016 presidential election. I also looked up approximate percentages of African-Americans (I will refer to this group as Blacks for simplicity of notation) and Latinos or Hispanics (I will refer to this group as Latinos for simplicity) in each state’s population. The percentage of whites or Caucasians (I will refer to this group as whites for simplicity) would be roughly 100% minus the sum of the above two other racial percentages (there remains Asians, etc.). Please understand that a legitimate statistical analysis of the election would require a vastly more complicated and thorough modeling than this poor little analysis we will consider. Nevertheless, our goal is merely to show how to do multivariate regression and how to evaluate the results. That is, we want to consider estimation and hypothesis testing associated with our regression.

The data is a cross section. It encompasses all 50 states in America in 2016. Therefore, our number of observations is 50, one for each state. You can see this clearly in the data set shown in Appendix 1. We have one observation on (%Trump, %Black, %Latino) and altogether 50 observations in the data set, one for each state.

The question we are asking is whether there is a pattern in racial densities in state populations and the percentage won by Trump. So, we will run a multivariate regression of the dependent variable, %Trump, on the explanatory variables, %Black and %Latino. We can write the regression equation as

for t = 1, 2, 3, ..., 50.[2] We would expect that both and would be negative and statistically significant. That is, greater percentages of Blacks and Latinos in the population should reduce the percentage of votes going to Trump.

Using gretl and running the regression (I will show in class how), we get the following output.

Figure 1 Multivariate Output on 2016 US Presidential Election

The results in Figure 1 show that the percentage of blacks in a state’s population had ZERO effect on the outcome of the election. However, the impact of Latino’s was statistically significant and had the expected sign (negative). You might be thinking that Blacks and Latinos do not live in the same states. You may think that in states where Black percentages are high, Latino percentages are low. But, you would be wrong. The correlation between the two explanatory variables is low and statistically insignificant (although it is negative). The “sample”or actual mean of the dependent variable is 49.578, but Trump's percentage of votes would have been hypothetically 53.46 if Blacks and Latinos had not been part of the population. That’s a hefty 4% rise in support due to racial percentages. The regression seems to be saying that race had an effect, but was mainly concentrated in the Latino vote and not the Black vote. Actually, this runs somewhat contrary to the general consensus that Trump did well with Latinos by garnering 30% of Latino vote.

The R2 and adj-R2 are not particularly high in this regression. The former was 0.144 while the latter was 0.108. This leaves about 85% or more to other factors to explain Trump’s percentage of support among the 50 states. An interesting article in the Guardian (a very left-wing UK paper) discussed how that white women drove the Trump phenomenon. Amazingly, a majority of white women voted for Trump, although college educated white women voted more for Hillary Clinton. Black and Latino women voted overwhelmingly for Clinton, especially Black women at over 90%.

Since the data are cross-sectional, we need not concern ourselves with autocorrelation. This is because autocorrelation is exclusively a problem with regressions using time series data, or where the order of the data matters.

The F-test shown in Figure 1 tests the hypothesis Ho: or that both slope coefficients are zero. It does not test whether the intercept is zero; it tests only the slopes. But, unlike the t-test for significance, it is testing whether the slopes are all zero, not just a single slope coefficient. The F density under the null hypothesis is shown in Figure 2. Note that the p-value is lower than 0.05 and thus we can reject the null that the two coefficients are both zero. The F-test does not establish which of the two coefficients is non-zero, or if both are non-zero. It shows whether the model we have put forward is significantly better at predicting the dependent variable than a single constant. This is exactly the function of the R2 statistic. In fact, there is a function relation between the R2 goodness of fit statistic and the F test statistic for zero slopes. Here is the exactly relation between the two.

The numbers 2 and 47 associated with the tests are called the numerator degrees of freedom and denominator degrees of freedom. The numerator degrees of freedom are the number of restrictions in the null hypothesis (in this case there are two restrictions, namely and ). The denominator degrees of freedom refer to the total number of observations (T = 50) minus the number of coefficients in the regression ( and ) plus the intercept () which in this case 47 = 50 – 3.

Obviously,as R2 gets close to 1, the will become unbounded. And, if R2 approaches 0, the statistic will become very small. Thus, reject Ho if R2 is sufficiently high, and do not reject that the slopes are all zero if the R2 is sufficiently low. This seems very reasonable.

Figure 2 The F Density and Testing for Zero Slopes (Check Figure 1)

There are many other statistical tests we can implement, but the t-test and F-test will prove to be very important to testing in general.

Example 2: (Time Series Data)

In this example, we want to analyze whether faster growth of the money supply tends to increase the rate of consumer price inflation. Although it seems to be a simple and intuitive idea, you should be warned that this is not as simple as it seems. There are many factors to consider in determining inflation, and the growth of money is only one of these. In fact, our initial regression will give us quite different results than our expectations led us to believe. This is not unusual, because there are many factors to consider.

Let's begin with the simplest model that says that the rate of consumer price inflation is a linear function of the rate of growth in broad money (M2). We can use the following approximation for the rate of growth in a variable X.

Therefore, if we have data on the money supply, M, and the price level,P, we can create the growth rates of money and the price level by using

and

The first times series regression we can try is the following:

This is a simple regression since there is only one right-hand side variable. The output is as below in Figure 3. Note that the estimated slope coefficient is clearly insignificant. The p-value is equal to 0.8254. The estimated coefficient is negative, but the t-test indicates that the underlying theoretical slope coefficient, , is zero. This means that over the sample, there does not appear to be any linear relation between inflation and money growth. How could this be possible?

Some observers might think that there is a delayed reaction. They might think that money growth in the past must have an impact on the level of inflation now. So, we might try regressing the consumer price inflation rate on current and past growth rates on money. Figure 4 shows the results of this analysis and once again it is surprising weak. None of the estimated slope coefficients are

Figure 3 OLS Regression of on

significant. What is even more troublesome is that the F-statistic for zero slopes on all the variables is insignificant at F = 0.500895. This means that even lagged money growth rates (for two years) have no effect on inflation. Again, how is this possible? What is more, the regression coefficients are mainly negative, though insignificant. The R2 is incredibly low at 0.001496.

Figure 4 OLS Regression of on , , and

Can we explain these facts? Should this regression revolutionize our ideas about the effect of money on inflation? The answer to the first question is yes, while the answer to the second question is no.

Let's take a different tack to the problem.[3] Suppose that the money supply over a period of a year is well controlled by the government or monetary authority. In addition, suppose that this manetary authority or central bank is charged with the responsibility of maintaining a stable level of inflation.[4] In that case, we would expect that a positive change in inflation would be met by a reduction in the growth of of money, and vice versa. in this case, we would expect that the regression would be inverted and we would have the change in inflation as the dependent variable. This is how it would look

where we expect that is negative. This is because government officials would try to raise interest rates and reduce spending if inflation we climbing higher. They would do this by selling govenrment bonds and this would drive down bond prices and interest rates up. However, selling bonds to banks would mean that money would be retired out of the banking system and the economy. This would reduce the money supply. Here is the output from gretl. Note that I added a lagged dependent variable to reduce the autocorrelation in the regression. This was successful and the coefficients appear to all be statistically significant and of the expected sign. The lagged dependent variable indicates that there is some inertia in implementing policy. That is, the long run effect of policy is greater than the short run effect of policy. This is because policymakers do not know whether the current data is precise enough to make a substantial changes in the money supply. They therefore adjust somewhat slowly to changes in inflation spreading the effect over time. An immediate change in policy (short run effect = long run effect) would require that the coefficient on lagged money be zero. Instead, the estimate is about 0.60.

What should the long run growth in M2 in the US be is there is no change in inflation. According to this model we would have to say that in the long run it should be

This says that M2 money should grow over the long run at about 5.5%. Here are the recent

growth rates of M2 in the US. Obviously, the growth of M2 is well above its long run average of 5.7% in most recent years. It also exceeds our estimate of what it should be to hold inflation down.

Figure 5 Taylor-like Rule for Money

Appendix 1: Data Set on US Presidential Election 2016

State / %Trump / %Black / %Latino
Alaska / 52.9 / 4.27 / 6.1
Alabama / 62.9 / 27.38 / 4.1
Arkansas / 60.4 / 15.76 / 6.8
Arizona / 48.1 / 5.16 / 30.2
California / 32.8 / 6.67 / 38.2
Colorado / 43.3 / 4.28 / 21
Connecticut / 41.2 / 10.34 / 14.2
Delaware / 41.9 / 20.95 / 8.6
Florida / 48.6 / 15.91 / 23.2
Georgia / 50.5 / 31.4 / 9.2
Hawaii / 30 / 3.08 / 9.5
Iowa / 51.2 / 2.68 / 5.3
Idaho / 59.2 / 1.95 / 11.6
Illinois / 39.4 / 15.88 / 16.3
Indiana / 57.2 / 10.07 / 6.3
Kansas / 57.2 / 6.15 / 11
Kentucky / 62.5 / 8.2 / 3.2
Louisiana / 58.1 / 32.4 / 4.5
Massachusetts / 33.5 / 8.1 / 10.1
Maryland / 35.3 / 30.1 / 8.7
Maine / 45.3 / 1.03 / 1.4
Michigan / 47.3 / 14.24 / 4.6
Minnesota / 45.4 / 4.57 / 4.9
Missouri / 57.1 / 11.49 / 3.7
Mississippi / 58.3 / 38.3 / 2.9
Montana / 56.5 / 0.67 / 3.1
NorthCarolina / 49.9 / 21.6 / 8.7
NorthDakota / 64.1 / 1.08 / 2.5
Nebraska / 54.6 / 4.5 / 9.7
NewHampshire / 46.5 / 1.22 / 3
NewJersey / 41.8 / 14.46 / 18.5
NewMexico / 40 / 2.97 / 47
Nevada / 45.5 / 9 / 27.3
NewYork / 37.5 / 15.18 / 18.2
Ohio / 51.3 / 12.04 / 3.3
Oklahoma / 65.3 / 7.96 / 9.3
Oregon / 41.1 / 2.01 / 12.2
Pennsylvania / 48.2 / 10.79 / 6.1
RhodeIsland / 39.8 / 7.5 / 13.2
SouthCarolina / 54.9 / 28.48 / 5.3
SouthDakota / 61.5 / 1.14 / 3.1
Tennessee / 61.1 / 16.78 / 4.8
Texas / 52.6 / 11.91 / 38.2
Utah / 45.9 / 1.27 / 13.3
Virginia / 44.4 / 19.91 / 8.4
Vermont / 32.6 / 1.87 / 1.6
Washington / 38.2 / 3.74 / 11.7
Wisconsin / 47.2 / 6.07 / 6.2
WestVirginia / 68.7 / 3.58 / 1.3
Wyoming / 70.1 / 1.29 / 9.5

Appendix 2 Time Series Data on Price Level (CPI) and Money Supply (M2)

DATE / CPI / M2
1/1/1981 / 90.933 / 1678.7
1/1/1982 / 96.533 / 1829.8
1/1/1983 / 99.583 / 2050.5
1/1/1984 / 103.933 / 2218.1
1/1/1985 / 107.6 / 2413.9
1/1/1986 / 109.692 / 2615.1
1/1/1987 / 113.617 / 2779.2
1/1/1988 / 118.275 / 2932.8
1/1/1989 / 123.942 / 3051
1/1/1990 / 130.658 / 3222.7
1/1/1991 / 136.167 / 3342.5
1/1/1992 / 140.308 / 3403.8
1/1/1993 / 144.475 / 3438.4
1/1/1994 / 148.225 / 3481.5
1/1/1995 / 152.383 / 3550.3
1/1/1996 / 156.858 / 3721.8
1/1/1997 / 160.525 / 3910.5
1/1/1998 / 163.008 / 4188.9
1/1/1999 / 166.583 / 4497.1
1/1/2000 / 172.192 / 4767.6
1/1/2001 / 177.042 / 5178
1/1/2002 / 179.867 / 5565.1
1/1/2003 / 184 / 5953.1
1/1/2004 / 188.908 / 6233.7
1/1/2005 / 195.267 / 6499.8
1/1/2006 / 201.558 / 6838.9
1/1/2007 / 207.344 / 7262
1/1/2008 / 215.254 / 7755.8
1/1/2009 / 214.565 / 8382
1/1/2010 / 218.076 / 8590
1/1/2011 / 224.923 / 9215
1/1/2012 / 229.596 / 10016
1/1/2013 / 232.964 / 10696.9
1/1/2014 / 236.715 / 11355.4
1/1/2015 / 236.995 / 12011.3
1/1/2016 / 240.024 / 12824.8

[1] If you inadvertently add a constant, you will find that you get an estimate for the constant that is exactly zero. This is because the sample means of the dependent and independent variables are both exactly zero by construction.

[2] The regression we are postulating is not valid. The reason is that the dependent variable is bounded between 0 and 100 percent, but the error term ranges between minus infinity and infinity. Nevertheless, it will be instructive to run the regression since it represents a quick and easy approximation to the better model called a logistic regression model. Fortunately, the results will not be too different. Another more serious and subtle problem with this type of analysis is that the dependent variable covers ALL 50 states and therefore cannot be considered a random sample, since there were no states left out. The population mean and the “sample” mean are therefore the same! This destroys all the inferential properties of the estimates. Believe me, there is plenty to complain about in this very elementary regression analysis. This should serve as a warning to you that just because you can run a regression, doesn’t mean it is a sensible regression - statistically speaking.

[3] A "tack" is a course or an approach (the word has nautical origins).

[4] In fact, something like this actually exists and is called the Taylor Rule, although it involves the contol of interest rates instead of the money supply and includes deviations from the natural growth of the economy as an explanatory variable, as well.