Econometric Analysis of Panel Data

Professor William Greene Phone: 212.998.0876

Office: KMC 7-90 Home page:people.stern.nyu.edu/wgreene

Email:

URL for course web page:

www.stern.nyu.edu/~wgreene/Econometrics/PanelDataEconometrics.htm

Final Examination: Spring 2012

This is a ‘take home’ examination. Today is Tuesday, May 1, 2012. Your answers are due on Monday, May 14, 2012. You may use any resources you wish – textbooks, computer, the web, etc. – but please work alone and submit only your own answers to the questions.

The eight parts of the exam are weighted as follows:

Part I. Literature 10

Part II. The Hausman and Taylor Estimator 10

Part III. Panel Data Regressions 20

Part IV. Fixed Effects 20

Part V. Instrumental Variable and GMM Estimation 10

Part VI. Binary Choice Models 25

Part VII. Sample Selection 25

Part VIII. A Loglinear Model 30

Note, in parts of the exam in which you are asked to report the results of computation, please filter your response so that you present the numerical results as part of an organized discussion of the question. Do not submit long, unannotated pages of computer output.

Part I. Literature

Locate a published study in a field that interests you that uses a panel data based methodology. Describe in a short paragraph (no more than 200 words) the study and the estimation method(s) used by the author(s).

Part II. The Hausman and Taylor Estimator

Write out a statement of the procedure that Hausman and Taylor devised for estimation of the parameters in a panel data model in which some independent variables are correlated with the time invariant part of the disturbance in a random effects model. Show how the Arellano/Bond/Bover estimator uses the Hausman and Taylor result.

Part III. Panel Data Regressions

(This part of the exam requires some computation. Do this exercise with Stata, SAS, Eviews, R, MatLab, LIMDEP (or NLOGIT), or any other software you wish to use.)

The course website contains a panel data set on production by northern Spanish dairy farms, in Excel spreadsheet form,

http://people.stern.nyu.edu/wgreene/Econometrics/dairy.csv

The file is also available in .xls form which can substitute for the .csv form, in .txt form as a plain text file, and in .lpj form as an nlogit project file. The data set is a balanced panel of 247 farms observed for 6 years, a total of 1,482 observations. The variables in the file are the output, MILK, and four physical inputs, F1=cows, F2=land, F3=labor, F4=feed. YIT is log(MILK). The ‘x’ variables are the logs of the factors. The logs are in mean deviation form, so that the overall sample means of X1,X2,X3,X4 are zero. The squares and cross products of the logged variables are denoted X11, X12, etc. The data set also includes a complete set of year dummy variables, YEAR93,…,YEAR98.

The basic model of interest is

Yit = b1X1it + b2X2it + b3X3it + b4X4it + ci + eit

This is a Cobb-Douglas production function with group effect, ci.

a. Fit the ‘pooled’ model and report your results

b. Fit a random effects model and a fixed effects model. Use your model results to decide which is the preferable model. If you find that neither panel data model is preferred to the pooled model, show how you reached that conclusion. As part of the analysis, test the hypothesis that there are no ‘farm effects.’

c. Assuming that there are ‘latent individual (farm) effects,’ the asymptotic covariance matrix that is computed for the pooled estimator in part a., s2(X′X)-1, is inappropriate. What estimator can be computed for the covariance matrix of the pooled estimator that will give appropriate standard errors? Compute this corrected covariance matrix. Compare the results to the uncorrected estimator.

d. The hypothesis of constant returns to scale would be that

H0: b1 + b2 + b3 + b4 = 1

Test this hypothesis in the context of the model in a. and in the context of your preferred model in part b. Do you reach the same conclusion in both cases?

e. The translog model adds to the Cobb-Douglas model all of the squares and cross products (of the logs) variables. For your four factor model, this will add 10 new variables. (Note, X1*X2 = X2*X1, etc. so not all possible products are added to the model, just the unique ones.) Test the hypothesis of the Cobb-Douglas model against the more general translog model and report your findings.

f. Many authors suggest that it is a good idea to include ‘time’ effects in a common effects model such as the one you have fit above. Fit your preferred model in part b. with time effects as well as group effects, then test the hypothesis that there is no separate time variation apart from the variation in the regressors in the model.

Part IV. Fixed Effects

(This part of the exam requires some computation. Do this exercise with Stata, SAS, Eviews, R, MatLab, LIMDEP (or NLOGIT), or any other software you wish to use.)

We can write the fixed effects linear model as

y = Xb + Da + e

where X is the nT´K matrix of data on the regressors and D is the nT´n matrix of group dummy variables. In this form, it is obvious that the OLS estimates of b can be obtained by LS regression of y on (X,D). An alternative way to compute the LSDV coefficients is to linearly regress y on (X,) where is the nT´K matrix that contains the K group means, replicated for each observation in the group.

a. Recompute the ‘pooled model’ that you fit in part IIIa. while adding the group means of X1, X2, X3, X4 to the regression. (Hint: You do not need an overall constant term. Just regress YIT on the 8 variables, i.e., the 4 raw variables and the 4 group means.) Verify (empirically) that this approaches gives the same results as the LSDV estimator you computed in part IIIb.

b. Prove this result algebraically. Hint: = D(D¢D)-1D¢X = (I – MD)X. Now, use the Frisch-Waugh theorem.

Part V. Instrumental Variable and GMM Estimation

(This part does not require you to do any estimation.)

For the setting in Part III, suppose we theorized that the capital variable, land (X2) was part of a ‘capital stock’ that also included the intellectual capital of the farmer (accumulated wisdom, etc.) and that this stock of capital was persistent through time, though not quite time invariant. We might be convinced to specify the following random effects model,

yit = b1X1it + b3X3it + b4X4it + gyi,t-1 + ui + eit

where ui is a random effect.

a. Is the OLS estimator a consistent estimator of the parameters of this model? Explain.

b. Suppose it were the case that LAND (X2) was time invariant and, moreover, was handed down from generation to generation in northern Spain, so that the level of LAND was exogenous in this panel by any definition. Show how the model could be fit using an instrumental variable estimator.


Part VI. Binary Choice Models

(This part of the exam requires some computation. Do this exercise with Stata, SAS, Eviews, R, MatLab, LIMDEP (or NLOGIT), or any other software you wish to use.)

The course website describes the “German Manufacturing Innovation Data.” The actual data are not published on the course website. We will use them for purposes of this exercise, however. You can obtain them by downloading

http://www.stern.nyu.edu/~wgreene/Econometrics/probit-panel.lpj

as well as csv, .xls and .txt formats. This data set contains 1,270 firms and 5 years of data for 6,350 observations in total – a balanced panel. The variables that you need for this exercise are described on the course home page, in the ‘Panel Data Sets’ area. I am interested in a binary choice model for the innovation variable, IP. You will fit your model using at least three of the independent variables in the data set. With respect to the model you specify,

A. THEORY

(a) If you fit a pooled probit model, there is the possibility that you might be ignoring unobserved heterogeneity (effects). Wooldridge argues that when one fits a probit model while ignoring unobserved heterogeneity, the raw coefficient estimator (MLE) is inconsistent for b, but the quantity of interest, the “Average Partial Effects” might well be estimated appropriately. Explain in detail what he has in mind here.

(b) Suppose we were to estimate a ‘fixed effects’ probit model by “brute force,” just by including the 1,270 dummy variables needed to create the empirical model. What would the properties of the resulting estimator of b likely be?

(c) Describe Chamberlain’s consistent slope estimator for the fixed effects logit model.

(d) Describe in detail how to fit a random effects logit model using quadrature and using simulation for the part of the computations where they would be necessary, under the assumption that the effects are normally distributed and uncorrelated with the other included exogenous variables.

(e) Using the random effects logit model that you described in part (d), describe how you would test the hypothesis that the same probit model applies to the four different sectors in the data set (CONSGOOD,FOOD,RAWMTL,INVGOOD).

B. PRACTICE

(a) Fit a pooled probit model using your specification. Use any software you wish for the estimation. Provide all relevant estimation results.

(b) Fit a random effects probit model.

(c) Use the Mundlak approach to approximate a fixed effects probit model. Recall this means adding the group means of the time varying variables to the model, then using a random effects model.

(d) The Hausman test is not useable for binary choice models, because of the incidental parameters problem. However, you can do something like a Hausman test by testing the joint significance of the group means you added in part c. Carry out the test, then suggest which model you prefer (statistically), random or fixed effects.

Part VII. Sample Selection

(Part d of this section of the exam requires some computation. Do this exercise with Stata, SAS, Eviews, R, MatLab, LIMDEP (or NLOGIT), or any other software you wish to use. The healthcare data may be downloaded from the Panel Data Sets area of the course home page, in .csv, .txt or .lpj format.)

This exercise is based on the German health care data that we have used in numerous class examples this semester. We begin with a probit model,

HEALTHY = 1 (zit¢a + uit) > 0 where z = one,AGE, HHNINC (income), MARRIED.

HEALTHY is defined as a binary variable that equals 1 if HSAT > 6, and 0 if less than or equal to 6. We follow with an (admittedly, not very well specified) model for the number of doctor visits,

DOCVISit = b1 + b2 AGEit + b3 EDUCit + b4 PUBLICit + eit

Suppose we now seek to examine the behavior of HEALTHY people. (We thus propose to ignore the unhealthy ones.) We consider the case in which the latent heterogeneity variables (uit and eit) that enter the two equations are correlated (bivariate normal).

(a) What problem would arise if we simply use linear regression to estimate the parameters of the DOCVIS equation?

(b) Does the problem go away if we simply change the regression to

DOCVISit = b1 + b2 AGEit + b3 EDUCit + b4 PUBLICit + gHEALTHYit + eit

and use the full data set to run our linear regression for the DOCVIS equation?

(c) For the moment, ignore the discrete nature of DOCVIS, and think of it as the dependent variable in a linear regression. Show how analysis of these data for the observations for which HEALTHY = 1 might translate into a familiar “sample selection” model. Describe in detail the computations one would do to fit such a model.

(d) Use the selection model that you described in (c) to analyze this variable, treating it as continuous. Fit a sample selection model, using several regressors that you select from the data set. (Note, we are ignoring the panel nature of the data set.) For those of you not using Stata or LIMDEP to do the computations, standard textbooks such as Greene describe the computations needed to compute the appropriate asymptotic covariance matrix. If you are able to do the computations, describe what you computed. If not, describe the computations you would program if you could.

(e) Suppose we change the model for DOCVIS to recognize its discrete nature. The model for DOCVIS is assumed to be a Poisson regression with conditional mean function

E[DOCVIS|x] = exp(b1 + b2 AGEit + b3 EDUCit + b4 PUBLICit + eit)

How does this change the estimation strategy that you will use to fit the parameters of the model? (You need not do the calculations for this part. Just describe them.)

(f) Suppose the two equations of the model are extended to include common random effects,

HEALTHY = 1 (zit¢a + uit + wi) > 0

where z = one,AGE, HHNINC (income), MARRIED.

We follow with an (admittedly, not very well specified) model for the number of doctor visits,

DOCVISit = b1 + b2 AGEit + b3 EDUCit + b4 PUBLICit + eit + vi.

Describe one of the methods that has been proposed to analyze this model. (This part does not require you to do any estimation.)