Econometrics 1
Problem set
Section C
(Multicollinearity & serial correlation)
Qs 1. Using the data in GPA1.RAW, the following equation is obtained where the dummy variable PC equals one if a student owns a personal computer and zero otherwise.
Now let noPC be a dummy variable equal to one if the student does not own a PC, and zero otherwise.
a. If noPC is used in place of PC in equation, what happens to the intercept in the estimated equation? What will be the coefficient on noPC? (Hint: Write PC =1-noPC and plug this into the equation
b. What will happen to the R-squared if noPC is used in place of PC?
c. Should PC and noPC both be included as independent variables in the model? Explain.
Qs 2. In an effort to explain regional wage differentials, you collect wage data from 7,338 unskilled workers, divide the country into four regions (East, South, North, and West), and estimate the following equation (standard errors in parentheses):
where: Yi = the hourly wage (in dollars) of the ith unskilled worker
Ei = a dummy variable equal to 1 if the ith worker lives in the East and 0 otherwise
Si = a dummy variable equal to 1 if the ith worker lives in the South and 0 otherwise
Wi = a dummy variable equal to 1 if the ith worker lives in the West and 0 otherwise
a. What is the omitted condition in this equation?
b. If you add a dummy variable for the omitted condition to the equation without dropping Ei, Si, Wi, what will happen?
c. If you add a dummy variable for the omitted condition to the equation and drop Ei, what will the sign of the new variable’s estimated coefficients be?
d. Which of the following three statements is most correct? Least correct? Explain your answer.
i. The equation explains 49 percent of the variation of Y around its mean with regional variables alone, so there must be quite a bit of wage variation by region.
ii. The coefficients of the regional variables are virtually identical, so there must not be much wage variation by region.
iii. The coefficients of the regional variables are quite small compared with the average wage, so there must be much wage variation by region.
Qs 3. In studying the movement in the production workers’ share in the value added (i.e., labor’s share), the following models were considered by Gujarati:
Model A: Yt = β0 + β1t + ut
Model B: Yt = α0 + α1t + α2t2 + ut
where Y = labor’s share and t = time. Based on annual data for 1949–1964, the following results were obtained for the primary metal industry:
where the figures in the parentheses are t ratios.
a. Is there serial correlation in model A? In model B?
b. What accounts for the serial correlation?
c. How would you distinguish between “pure’’ autocorrelation and specification bias?
Qs 4. In a study of the determination of prices of final output at factor cost in the United Kingdom, the following results were obtained on the basis of annual data for the period 1951–1969:
where PF = prices of final output at factor cost, W = wages and salaries per employee, X = gross domestic product per person employed, M = import prices, Mt−1= import prices lagged 1 year, and PFt−1 = prices of final output at factor cost in the previous year.
“Since for 18 observations and 5 explanatory variables, the 5 percent lower and upper d values are 0.71 and 2.06, the estimated d value of 2.54 indicates that there is no positive autocorrelation.’’ Comment.
Qs 5. The residuals from a regression when plotted against time gave the scattergram in Figure 12.12. The encircled “extreme’’ residual is called an outlier. An outlier is an observation whose value exceeds the values of other observations in the sample by a large amount, perhaps three or four standard deviations away from the mean value of all the observations.
a. What are the reasons for the existence of the outlier(s)?
b. If there is an outlier(s), should that observation(s) be discarded and the regression run on the remaining observations?
c. Is the Durbin–Watson d applicable in the presence of the outlier(s)?
Qs 6. W. Bowen and T. Finegan2 estimated the following regression equation for 78 cities (standard errors in parentheses):
where: Li = percent labor force participation (males ages 25 to 54) in the ith city
Ui = percent unemployment rate in the ith city
Ei = average earning (hundreds of dollars/year) in the ith city
Ii = average other income (hundreds of dollars/year) in the ith city
Si = average schooling completed (years) in the ith city
Ci = percent of the labor force that is nonwhite in the ith city
Di = a dummy variable equal to 1 if the city is in the South and 0 otherwise
a. Interpret the estimated coefficients of C and D. What do they mean?
b. How likely is perfect collinearity in this equation? Explain you answer.
c. Suppose that you were told that the data for this regression were from one decade and that estimates on data from another decade yielded a much different coefficient of the dummy variable. Would this imply that one of the estimates was biased? Why or why not?