ANALYSIS OF EXPERIMENTAL DATA
INTRODUCTION
In this section of the course, we will use the classical linear regression model and dummy variables to analyze data that comes from a randomized controlled experiment or a natural experiment. Experimental data is often used for program evaluation. Program evaluation involves estimating the effect of a treatment on an outcome. The treatment is typically a program, policy, or some other intervention. A new field of economics, called experimental economics, specializes in conducting small scale randomized controlled experiments to test implications of economic theory.
Examples
To explain the analysis of experimental data, I will refer to two examples. The first example involves estimating the effect of a job training program on a person’s wage. The goal of the program is to provide the person with better job skills to increase his productivity and wage. The second example involves estimating the effect of a drug on a person’s health.
RANDOMIZED CONTROLLED EXPERIMENT
Most experimental data comes from a randomized controlled experiment. A randomized controlled experiment works like this. A random sample of individuals is taken from a population of interest. These individuals are then randomly assigned to two groups: a treatment group and a control group. The treatment group gets the treatment; the control group does not. The two groups are followed for a certain period of time. At the end of this period, the outcome variable is measured for each individual in the two groups. This data is then used to estimate the causal effect of the treatment.
We select a random sample of 200 individuals; 100 are randomly assigned to a treatment group and 100 to a control group. The treatment group is assigned to a job training program; the control group is not. We then follow these individuals for two years. At the end of this period, we collect data on the wages for the 200 individuals.
We select a random sample of 200 individuals; 100 are randomly assigned to a treatment group and 100 to a control group. The treatment group is given a drug to take each day; the control group is not. We then follow these individuals for two years. At the end of this period, we collect data on some measure of health for the 200 individuals.
The key feature of a randomized controlled experiment is random assignment to treatment and control groups. We know that a number of variables may affect a person’s wage. These include education, work experience, gender, marital status, and innate ability. If individuals are randomly assigned to treatment and control groups, then the distribution of each of these variables will be the same in both groups. On average, the two groups will have the same education, experience, gender, marital status, innate ability, and all other variables. The only difference between the two groups will be that one gets the treatment and the other doesn’t. The treatment is distributed independently of these other variables that affect the wage. Because the treatment is not correlated with these variables, they cannot be confounding variables. Randomization controls for observable and unobservable confounding variables.
SINGLE TREATMENT STATISTICAL MODEL
For this type of randomized controlled experiment, the statistical model that describes the data generation process is a simple classical linear regression model. This is identical to a one-way analysis of variance model.
Yi = β0 + β1Xi + μi for i = 1,…, 200
Yi is the person’s wage at the end of the two year period. If only one treatment is given to the treatment group, then Xi is a dummy variable: Xi = 1 if the ith person gets the treatment, Xi = 0 if the ith person does not get the treatment. μi is the error term. It contains all of the variables that affect the wage other than the treatment. While these variables affect the wage, they are not correlated with the treatment, so they are not confounding variables. As a result, the μi is not correlated with Xi. The OLS estimator will produce an unbiased estimate of β1. This is an estimate of the causal effect of the treatment on the wage. The causal effect is the difference in the expected value of the wage for the two groups. This can be shown as follows.
E(Y|X=0) = β0This is the average wage for the control group.
E(Y|X=1) = β0 + β1This is the average wage for the treatment group
β1 is the difference between the average wage of the treatment and control groups.
Estimation
To estimate the causal effect, we use the OLS estimator. The OLS estimate of β0 is just the sample mean wage for the control group. The OLS estimate of β0 + β1 is the sample mean wage for the treatment group. The estimate of β1 is the difference in sample means. Because of this, the OLS estimator of β1 is called the difference estimator.
The OLS estimate of the causal effect is an average causal effect. If we find evidence of an effect for the treatment, then we would conclude that the treatment has an effect on an average or typical individual. It is the average of the causal effects of all individuals that get the treatment. It is possible that all individuals may have the same causal effect. In this case, we say that the population is homogeneous. However, it is more likely that some individuals may have a larger causal effect, others a smaller causal effect, and yet others no causal effect. But the average of these effects is positive. In this case, we say that the population in heterogeneous. For example, a drug may have a bigger effect on females and a smaller effect on males that get the treatment. It may have a bigger effect on individuals with an unhealthy lifestyle and a smaller effect on those with a healthy lifestyle.
Hypothesis Testing
The hypothesis of no causal effect is: H0: β1 = 0. This can be tested with a t-test.
SINGLE TREATMENT STATISTICAL MODEL WITH HETEROGENEITY
If you have a heterogeneous population and you know the source(s) of heterogeneity, then you can specify a statistical model that accounts for the heterogeneity. For example, suppose that you believe the causal effect of a treatment depends on gender (G). We specify a multiple classical linear regression model that includesan interaction term for treatment and gender. Let G=1 if male; G=0 if female.
Yi = β0 + β1Xi + β2XiG + μi for i = 1,…, 200
The causal effects are,
E(Y|X=0) = β0This is the average wage for the control group.
E(Y|X=1, G=0) = β0 + β1This is the average wage for females in the treatment group
E(Y|X=1, G=1) = β0 + β1 + β2This is the average wage for males in the treatment group
β1 is the difference between the average wage of the female treatment and control groups.
β1 + β2 is the difference between the average wage of the male treatment and control groups.
β2 is the difference between the average wage of male and female treatment groups.
Estimation
To estimate the causal effect, we use the OLS estimator. This is the same as the sample means for the control, female treatment, and male treatment groups.
Hypothesis Testing
We can test the hypothesis of no causal effect for the treatment females and males using an F-test; no causal effect for treatment females using a t-test; no causal effect for treatment males using an F-test; no difference in causal effect between treatment males and females using a t-test.
MULTIPLE TREATMENT STATISTICAL MODEL
Suppose that individuals in the treatment group get three difference amounts of the treatment. For example, individuals in the treatment group get one month, two months and three months of job training. If we are willing to assume that the treatment has a linear effect on the wage, we can specify the following simple classical linear regression model.
Yi = β0 + β1Xi + μi
where X can take a value of 0, 1, 2, or 3. Individuals in the control group get a value of 0. Individuals in the treatment group get a value of 1, 2, or 3. The causal effects are:
E(Y|X=0) = β0This is the average wage for the control group.
E(Y|X=1) = β0 + β1This is the average wage for the treatment group with one month of job training.
E(Y|X=2) = β0 + β1(2)This is the average wage for the treatment group with 2 months of job training.
E(Y|X=3) = β0 + β1(3)This is the average wage for the treatment group with 3 months of job training.
To estimate the causal effects, we use the OLS estimator. The hypothesis of no causal effect is tested with a t-test. If we don’t want to impose a functional form on the treatment effect, we can specify dummy variables for four groups: control, one-month treatment, two-month treatment, three-month treatment. We omit the dummy variable for the control group to avoid perfect multicollinearity and include the dummy variables for the three treatment groups as explanatory variables. We specify the following multiple classical linear regression model,
Yi = β0 + β1X1i + β2X2i + β3X3i + μi
where X1 is the dummy variable for one-month treatment, X2 is the dummy variable for two-month treatment, and X3 is the dummy variable for three-month treatment. We use the OLS estimator. The interpretation of causal effects and hypothesis tests are straightforward.
POTENTIAL PROBLEMS WITH RANDOMIZED CONTROLLED EXPERIMENTS
Several potential problems may arise when doing a randomized controlled experiment. These problems result in a different data generation process than a true randomized controlled experiment, and therefore the models we have presented above may not accurately describe this process. This may result in a biased estimate of the causal effect of the treatment. Other potential problems may result in an imprecise estimate of the treatment effect.
Potential Bias
These problems may result in a biased estimate of the causal effect for the population studied. If so, the empirical study is not internally valid. Later in this class, we will learn about estimators that can be used to decrease or eliminate this bias. These are called the difference in differences estimator and the instrumental variables estimator.
Non-Random Assignment
A key feature of a randomized controlled experiment is random assignment. This controls for confounding variables and reverse causation so that the error term is not correlated with the treatment. Nonrandom assignment may result in failure to control for one or more confounding variables, and therefore a biased estimate of the causal effect. For example, suppose that individuals are assigned to treatment and control groups by education. Individuals assigned to the treatment group have a high school education or less, while those assigned to the control group have more than a high school education. The treatment is now correlated with education. If education has an effect on the wage, then the estimate of the causal effect of the treatment will also include the effect of education and be biased. Said another way, the error term is correlated with the treatment. Individuals with more education have higher wages, and therefore bigger errors. Individuals with more education are assigned to the control group and don’t get the treatment (X=0). As a result, the error term has a negative correlation with the treatment and we would expect the OLS estimator to be biased downward. The estimate of the effect of the treatment is too small because those being treated have less education than those not treated. As a second example, suppose that individuals are assigned to treatment and control groups by the first letter in their last name. Individuals with the letters A through M are assigned to the treatment group. Those with the letter N through Z are assigned to the control group. Suppose that as a result, the two groups differ by ethnicity. Suppose that ethnic groups differ in their work experience, innate ability, and other characteristics that affect the wage. Once again, the error term will be correlated with the treatment, which will result in a biased estimate of the causal effect. As a third example, suppose that the job training experiment includes working individuals who currently have jobs. Suppose that workers with lower wages are assigned to the treatment groups and those with higher wages to the control group. Now, wage has an effect on treatment and there is reverse causation. Workers with bigger errors have unobserved characteristics that result in higher wages. But workers with higher wages are more likely to be assigned to the control group (X=0). As a result, the error term is negatively correlated with treatment, and the OLS estimate of the causal effect will be biased downward.
Partial Compliance
Suppose that some of the individuals that are randomly assigned to the treatment group don’t comply with the treatment. For example, some do not attend the job training sessions. This is called partial compliance. Suppose that those individuals that comply are those with more motivation. We now have random assignment of individuals but not random assignment of treatment. We have more motivated individuals actually getting the treatment and less motivated individuals not getting the treatment. Suppose that motivation is a factor that affects the wage. The error term is now correlated with the actual treatment. Individuals with more motivation tend to have higher wages and bigger errors. They also are more likely to get the treatment. As a result, the error term is positively correlated with the treatment and we would expect the OLS estimate of the treatment to be biased upward. That it, the estimate of the effect of the treatment on the wage will be too high because it also includes the effect of motivation.
There are two possible partial compliance scenarios. 1) The researchers know who doesn’t comply. For example, they know if an individual assigned to the treatment group does not attend job training sessions. 2) The researchers do not know who doesn’t comply. For example, in a drug study they do not know whether an individual assigned to the treatment group actually takes the drug. This results in a measurement error in the treatment actually received, which may bias the OLS estimator.
Attrition
Attrition occurs when individuals drop out of the study. If individuals drop out of the study for reasons other than those related to the wage, then the OLS estimator will still be unbiased. However, suppose that individuals in the treatment group with the most ability believe the job training program is a waste of time, and therefore drop of the study. Economists say that these individuals self-select out of the sample. Now the treatment group will have individuals with lower average ability than the control group. You don’t have a random sample of individuals in both the treatment and control groups; rather, you have a self-selected sample. Because ability affects the wage, the error term will be negatively correlated with treatment, and therefore the OLS estimate of the treatment will be biased down.
Placebo Effect
The placebo effect occurs when a treatment has a beneficial effect because an individual expects it to, not because it actually does. For example, suppose that individuals in a job training program study know whether they have been assigned to the treatment or control group. Those assigned to the treatment group may change their behavior and try harder to be more productive because they believe that a job training program will increase their productivity and result in a higher wage. If this happens, the job training program does not have an effect on the wage. It is the greater effort put forth that has an effect on the wage. This results in a biased estimate of the causal effect of the program. When possible, the researchers can control for the placebo effect by giving the control group a placebo so these individuals believe they are also getting the treatment. This is typically done in drug studies.
Short-Term vs Long-Term Effects
If individuals in the study are followed for one year, then we get an estimate of the one-year causal effect; if two-years, then the two-year causal effect, etc. The estimate of a short-term effect may be a biased estimate of the long-term effect.
Potential Imprecision
This problem may result in an imprecise estimate of the causal effect, and therefore make it difficult to detect an effect if one exists.
Sample size
Because randomized controlled experiments are expensive, sometimes they are done with a small number of individuals. While this does not result in a biased estimate, it may result in an imprecise estimate. Why? If individuals are randomly assigned, then treatment and control groups will be similar in terms of all characteristics except for the treatment. Any differences in characteristics will be the result of random sampling error. The law of large numbers tells us the larger the sample size the smaller sampling error and the more similar the treatment and control groups. The larger the sample size, the bigger the probability that an observed effect is the result of the treatment and not differences in individual characteristics. Therefore, the larger the sample size the smaller the standard error of the estimator and the more precise the estimate.
NATURAL EXPERIMENT
Randomized controlled experiments are rare in economics because they are costly and oftentimes unethical to do. However, the methods of a randomized controlled experiment can be applied to a natural experiment. In a natural experiment, it appears as if some or all individuals are randomly assigned a treatment. There are two types of natural-experiments.