Eco420, Prof. Bill Even

OLS REGRESSION ASSIGNMENT, Fall 2013

The assignment is dueFriday 9/20 by 4 p.m. Submit your assignment via email by the deadline. Late assignments will be penalized at the rate of 20 percentage points for every day (or part thereof) that the assignment is overdue. All team members will receive the same grade unless someone convinces me that I should do otherwise.

Provide a type-written response to all the questions. Paste the relevant portion of the statalog (both the stata commands and the output) beneath the relevant part of each question in this word document and then provide a type-written explanation (or leave adequate space for handwritten explanations beneath relevant stata code and results). Be sure to include enough Stata code that I can determine exactly how you generated your data, variables, and results. If I am unable to determine what you did, I will assume that it is wrong.

The data set you will use for this exercise is contained in g:\eco\evenwe\marchcps. It is the same data set that you used during our stata lab during class (cpsmar2011.dta) and will use the same sample and some of the same variables as you created there.

In addition to the variables you already created for wage, age, sex, race, and education, create the following:

  1. Union: a dummy variable indicating whether a worker is covered by a union (based on a_unioncov in codebook);
  2. 4 dummy variables to describe a person’s marital status (based on a_maritl in codebook).
  3. Married: if a_maritl>=1 and <=3
  4. Widowed: if a_maritl=4
  5. Divorced: a_marit=5 or6
  6. Single: if a_marit=7
  7. A weight variable (based on a_fnlwgt in codebook) – drop observations with a weight of zero.

QUESTIONS (relevant stata routines are provided in italics.)

1. (16 points)

a. Estimate the mean of the wage rate for men and women (see summarize).

b. Estimate a regression of the wage rate on an intercept and the female dummy (see reg).

c. How do the coefficients in your regression relate to the mean wages calculated in 1a? Explain the connection clearly.

d. Using this simple regression, show that the mean of predicted wages for the entire sample, the male sample, and the female sample match the actual means. [seepredict command for reg.]

e. The CPS is a stratified random sample and weights are provided to allow the researcher to generate weighted averages that should match the population. The weight that you saved above is an “inverse probability weight” (i.e. the inverse of the probability that a particular household is sampled). Hence, if a household has probability p of being sampled, they should count as (1/p=weight) households when estimating the population mean. If the CPS was a pure random sample, 1/p would be constant across households.

Type “help weight” in Stata to learn about the different types of weights allowed in Stata. Repeat (a)-(d) using weights for each calculation. Note that summarize does not allow for pweights, but does allow for aweights or fweights. If you use fweights, the weight must first be converted to an integer [replace weight=int(weight) ].

In calculation of simple means, the choice of aweight, pweight, or fweightdoesn’t matter. In all cases, the result is a simple weighted mean. In the case of regressions, the choice of weights matters for standard errors and t-statistics, but not coefficient estimates. Since the weights in the CPS are inverse probability weights, you should use pweights for the regression.

2. (12 points)

a. Without weights, estimate the following two wage equations:

  • Specification 1: include dummy variables for sex, union membership, and education.
  • Specification 2: include same variables as in (1) plus age and age-squared.

Summarize your regression results (coefficients, t-stats) in a single table using outreg2.[1] You may copy/paste the output from outreg2 into this document. [Be sure to use outreg2 for this part of the exercise because I want you to have the ability to summarize regression results for the remainder of the semester.]

b. Based upon what we know about omitted variables bias, why does the coefficient on the union dummy change in the observed direction when the age variables are added to the regression? Use a regression of union membership on the omitted variables to show that the relevant conditions required for the observed change exist in this data set. [Note: Whenever you run a regression, a variable callede(sample) is created as an indicator for which observations were used in the regression sample. You will want to be sure that you are using the same sample for all parts of this problem. For example, after you run a regression, gen insamp=e(sample) will create a dummy indicating whether the observation was in that regression. In subsequent regressions, you can restrict it to the same sample with reg y x if insamp==1]

3. (12 points)

a. Using the least educated group as the reference group for education and specification 2 from question 2a, test the null hypothesis that the intercept in the earnings equation is identical across all education groups. Interpret the result (i.e. indicate whether the null is rejected and at what critical level). (seetest command in stata.)

b. Repeat the test in 3a using the most educated group as the reference group.

c. How do the results of the tests in 3a and 3b compare? Be as precise as possible in your comparison. Why should you have expected this?

d. Show how the coefficients from the first specification (3a) could have been used to estimate the coefficient on the high school graduate dummy that you estimated in the second specification (3b). Explain.

4. (8 points)

Re-estimate the complete specification from (2a) using the natural log of wagein place of wage. Compare the coefficient on the female dummy in the wage and ln(wage) equation. How does the interpretation of the coefficient on the female dummy change when you switch the dependent variable from wage to ln(wage)? Do the two coefficients seem to suggest similar quantitative differences between male and male wages? Provide a numerical comparison of the implied effect of sex on wages from the two specifications.

5. (10 points)

Using the complete specification in 2b as the starting point, test the null hypothesis that the effect of union coverage is identical for men and women while allowing for different intercepts by gender but constraining all other coefficients to be equal across gender. Interpret your results. [Note – interactions are useful here.]

6. (7 points) Test the hypothesis that all coefficients including the intercept (using the complete specification in 2b) are equal for men and women using ln(wage) as the dependent variable. Discuss the implications of your test static.

7. (12 points)

To illustrate the effect of errors-in-variables, define 2 new variables for age.

gen bage1=age+10*invnorm(uniform()) ;

gen bage2=age+50*invnorm(uniform());

invnorm(uniform()) generates a random error draw from a N(0,1) distribution.

Notice that the variance of the noise in bage2 is 25 times greater than that in bage1.

a. Estimate a wage equation with your female dummy, education dummies, union coverage, and age only. What happens to the coefficient on age as noise is added (i.e. if actual age is replaced by bage1 or bage2)? Explain why you should have expected this.

b. What happens to the coefficient on union coverage? What does this tell you about the relationship between union coverage and worker age? Use the stata command “correlate” to examine your prediction.

8. (20 points) Use STATA to perform a Oaxaca-Blinder decomposition of the wage gap between men and women using the complete specification employed in 2b (without the female dummy). Use the results to identify

a. how much of the wage gap between men and women can be accounted for by gender differences in all of the control variables.

b. how much of the wage gap between men and women can be accounted for by gender differences in

i. the level of education

ii. union membership

Note: While there are canned routines available for doing Oaxaca-Blinder decompositions in Stata, I want you to learn how to use the matrix programming features. So you are required to use matrix (not canned routines) for this problem.

After a regression, you can import the coefficient estimates into a matrix (call it beta1), the variance-covariance matrix into v1, and the matrix of means for a list of variables as follows:

regresshourwage age;

matrix beta1=get(_b); *puts coefficients in a row vector;

matrix v1=get(VCE); *gets variance-covariance matrix for beta1;

matrixaccum xx=age female, means(xbar); *puts means of age & female in a matrix called xbar;

(Note: xbar will automatically include a column of ones in the last column. )

You can create a vector of means for females only with

matrixaccum xx=age female, means(xbar2), if female==1;

You can use matrix commands to manipulate the matrices. For example, to create the predicted mean at xbar,

matrixybar=xbar*beta1’;

You can also extract subvectors of a matrix. For example,

xbarj=xbar[1,2]

creates a matrix containing the element in the first row and second column of xbar. Alternatively,

xbarfem=xbar[.,"female"]

creates a matrix containing the elements corresponding to the column with the female variable in it.

For other matrix commands, see the chapter on matrix programming in the Stata manual or go to

[1] The downloadable program “outreg2” is very useful for generating tables and t-stats. It requires a little effort now, but saves a lot of work for you in the future. To download a third party program, inside of stata go to “helpsearch” and then “search all” and type in the program you’re interested in. You will see a link that allows you to install the software along with a help file. To get help with the program, type “help outreg2” into the command window.)