MKTG 7825, Spring 2010
Homework #3, Due on 02/26/2010
Stata Exercises
1. OLS Regression on Truncated Data
In this exercise, we will conduct a Monte Carlo study in order to examine the finite-sample properties of the OLS estimator in the presence of data truncation; and find a solution to this problem.
Data truncation is a problem commonly encountered in social science research. Consider a hypothetical example: a researcher is interested in the effect of education on the income level of all women of work age. Ideally, the research would like to collect data on (1) educational achievement and (2) income level from a representative sample of women of work age. However, such an ideal sample might fail to contain the following two types of women:
- Type I women.Women who expects to earn less than an annual income of $30K (assume that the expectations are rational, i.e., these women have precise estimates on how much they will earn if they indeed choose to work) may choose to be full-time housewives and/or soccer moms. Denote such women as type I women. Because of the prohibitive cost in reaching and interviewing these women, their data are not included in the sample.
-Type II women.Women who earnmore than $200K annual might be unwilling to disclose how much they really earn. In this casedata from this type of women is also unavailable.
We assume that the researcher has ready access to the third type:
- Type III women.Women who works and earn less than an annual income of $200K
Notice that if data from either type I or type II, or both is missing, the researcher ended up with a truncated sample. For example, data will only contain observations with working women whose salary is between $30K and $200K.
1a) Fabricating Data
Simulatethe datasets (with both dependent and independent variables) according to the following specification:
whereisthe amount of education (years spent school) and is the annual income (denoted in $1,000).
-The sample contains 300 observations (before any truncation).
-are integers values drawn from the following values 6,9,10,…, 20.
- are drawn from i.i.d Normal Distributions:
Create one data set of for each of the four types: (1) all (300) observations kept; (2) only Type I observations are dropped; (3) only Type II observations are dropped and (4) Both Type I and Type II observations are dropped. Table 1 summarizes the way in which the four types of sample data are denoted.
Table 1 Types of Fabricated DataType II included / Type II not included
Type I included / NT / RT
Type II not included / LT / BT
1b) Summarizing Data
For each of the four samples (NT, RT, LT and BT) created in step 1a), (1) provide summary statistics (e.g., number of obs., mean, std. dev.) of the dependent variable () and (2) produce a histogram that shows the distribution of the dependent variables. Note: Don’t just copy and paste Stata output in step 1.
1c)Producing and Summarizing Sampling Distributions
For each of the four different types, estimate the OLS model for 500 times using different draws of and . Save all the coefficient estimates of.
(1) Draw histogram for each of the four sampling distributions.
(2) Provide summary statistics for each of the four sampling distributions.
(3) Compare the means of the four sampling distributions to the true value of. Then conduct a t-test to see whether the difference is significant.Discuss intuitively why one each of the mean is larger (or smaller) than the true value.
1d) Solving the Problem
Find a Stata command that addresses the data truncation problem. Then conduct Monte Carlo Studies to generate the sampling distributions (again, by estimating the model 500 times); use the t-test to verify that the problem due to truncation is solved (i.e., the new parameter estimate is unbiased).