Understanding Finite Sample Bias from Instrumental Variables Analysis in Randomized Trials

Abstract Title Page

Title:

Authors:

Howard S. Bloom, Ph.D., MDRC

Pei Zhu, Ph.D., MDRC

Fatih Unlu, Ph.D., Abt Associates Inc.

Abstract Body

This paper assesses the promises and potential problems of using instrumental variables analysis in randomized trials to obtain internally valid (unbiased or consistent) and reasonably precise (powerful) estimates of causal relationships between characteristics of settings (mediators) and individual-level outcomes. It unpacks the statistical problems with instrumental variables analysis in ways that are accessible to a broad range of applied researchers. In doing so, the paper focuses on one key aspect of the problems (finite sample bias) and develops an understanding of it based on “first principles”. It then uses this deeper understanding to develop intuition about how to assess and address the problem in practice.

Why should we care about causal relationships between mediators and individual outcomes?

In the past six years, education research has taken a quantum leap forward based on a large and growing number of high-quality randomized field trials and regression discontinuity studies of the effects of educational interventions.[1] Most of this new research and existing methodologies for conducting it focus on the response of student academic outcomes to specific educational interventions.[2] Such information is invaluable and can provide a solid foundation for accumulating much-needed knowledge. However, this information only indicates how well specific interventions (which comprise complex bundles of features) work for specific students in specific settings. Therefore by itself, the information is not sufficient to ascertain “What works best for whom, when and why?” And it is this more comprehensive understanding of educational interventions that is needed to guide future policy and practice.

In other words, it is necessary to “unpack the black boxes” being tested by randomized experiments or high-quality quasi-experiments in order to learn how best to improve the education—and thus life chances—of students in the U.S., especially those who are economically disadvantaged. This unpacking job comprises learning more about the relative effectiveness of the active ingredients of educational interventions (their mediators) and learning more about factors that influence the effectiveness of these interventions (their moderators).[3] Now that multi-site randomized experiments and rigorous quasi-experiments have been shown to be feasible for educational research[4] it is an opportune time to begin to explore these subtler and more complex questions.

Why use instrumental variable analysis?

Instrumental variables analysis originated in the work of P.G. Wright (1928) who developed the approach to study the demand for and supply of flax seed.[5] Since then, the approach has been used for a wide range of applications.[6] Of particular relevance for the present paper is the use of instrumental variables analysis in the context of multi-site randomized experiments or quasi-experiments to study the effects of mediating variables on final outcomes. Early applications of this approach focused mainly on estimating the effect of receiving an intervention instead of just being assigned to it.[7] Recent extensions of the approach have begun to use it to explore causal effects of other mediating factors. For example, data from a randomized trial of subsidies for public housing residents to stimulate movement to lower-poverty neighborhoods were used to study the effects of neighborhood poverty on child outcomes (Kling, Liebman, and Katz, 2007).

These studies use the cross-site pattern of observed intervention effects on key mediators (e.g. neighborhood poverty or family income) and observed intervention effects on key final outcomes (e.g. child development or behavior) to study causal relationships between mediators and final outcomes. To the extent that the effects of an intervention on a mediator are correlated across sites or other subgroups with the effects of the intervention on a final outcome, this provides evidence of a causal relationship between the mediator and the final outcome.

Instrumental variable analysis has its own limitations and necessary assumptions. Nevertheless, evidence derived from this method is likely to be stronger than that provided by more traditional methods based on correlation analysis or regression analysis, which are almost always subject to some combination of “attenuation bias” due to measurement error (e.g. Director, 1979), “omitted variables bias” due to unmeasured variables not accounted for, and simultaneity bias, due to reciprocal causality. Hence, there are good reasons to believe that the newly evolving approach might offer a more promising way to unpack the black boxes represented by complex educational initiatives.

What did we study and learn?

Even though instrumental variable analysis is gaining popularity and may have the potential for a broad range of social science applications, it is subject to some important statistical problems that are not widely understood.

One such problem is “finite sample bias,” which as demonstrated by recent research, can distort findings even from exceptionally large samples (Bound, Jaeger, and Baker, 1995). Unfortunately, the existing literature on this problem is highly technical and accessible mainly to econometricians and statisticians, even though the approach is potentially most valuable to applied social scientists. It is with this in mind that we have attempted to “unpack” the problem in ways that promote a broader intuitive understanding of what produces it, how to assess its magnitude and consequences, and thus how to decide when to use the new approach.[8] In doing so we derived finite sample bias from basic principles and constructed derivations that facilitate a practical understanding of how to assess and address this problem.

It is important to note that some (but not all) of our results have already been established in the extant literature. Hence, we focus on the intuition involved in these results and we aim to make this intuition available to the broadest possible audience of applied researchers.

We start with the simple case of a single mediator and a single instrument. Specifically, we consider a series of relationships among a treatment indicator,, a mediator, , and an outcome, , with treatment status randomly assigned to individual sample members, . The series of relationships of interest are: (i) the causal effect of treatment on the mediator, (ii) the causal relationship between the mediator and the outcome, and (iii) the cross sectional relationship between the mediator and the outcome. Note that this final parameter reflects a combination of extraneous factors like attenuation bias, omitted variables bias, and simultaneity bias. It is well known (but not fully appreciated) that conventional ordinary least squares (OLS) regression of the outcome on the mediator yields a biased estimate of the causal relationship and this bias, often called “OLS bias” can be characterized as the difference between and (for example, see Angrist and Krueger, 2001).Instrumental variables analysis is used to estimate the casual effect of the mediator on the outcome via two-stage least squares (TSLS) from the following model:

First Stage:(1)

Second Stage:(2)

where and are random error terms and is the predicted mediator, constructed using the OLS parameter estimates () from the first-stage equation (=).is the TSLS estimator of the effect of the mediator on the outcome.

Notice that is a function of, which is the estimated effect of treatment on the mediator. canbe represented as a combination of its true value, , and the first-stage estimation error, , which reflects the imperfect randomization in a finite sample (i.e., a treatment and comparison mismatch). Therefore the variation in has two parts: (1) that induced by the true treatment effect on the mediator, , which we call “treatment-induced variation (tiv)”; and (2) that driven by estimation error, , which we call “error-induced variation (eiv).” Using this framework, we show that the error-induced variation in the predicted mediator causesthe finite sample bias in the estimate of. We also demonstrate that this finite sample bias () is proportional to OLS bias (). More specifically, we show that the ratio of to is approximately equal to the ratio of the expected value of the () to the expected value of the total variation in (= E[eiv]+E[tiv]).

We then examine the use of the population F statistic for the first-stage regression () to measure the strength of an instrument for a given application in order to assess its finite sample bias (Bound, Jaeger, and Baker, 1995).We show that is approximately equal to the ratio of the expected value of the total variation in () to the expected value of the error-induced variation (). We also illustrate that the ratio of to can be characterized (in approximation) as the inverse of. This conclusion is consistent with findings reported in the econometrics literature(for a review, see Hahn and Hausman, 2003).

Next, we analyze an extension of the single instrument and single mediator case, namely the use of multiple instruments with a single mediator: a situation that researchers often face in a multi-site experimental setting. Specifically, we examine what happens to the expected value of the TSLS estimator and to the corresponding first-stage F statistic when multiple instruments —created by interacting treatment status with site (or some other strata) indicators—are used. We first focus on the special case of a constant treatment effect on the mediator across sites. We demonstrate that, as in the single instrument and single mediator case, (where K stands for the number of sites from now on) is a proportion of, and the proportion is approximately the ratio between the expected values of error-induced variation and the total variation in.[9]

Secondly, we examine the use of multiple instruments for a single mediator given a varying first-stage impact (i.e., effect of the treatment on the mediator varies by site). In particular, we consider the case with a constant difference ()among K site specific first-stage impacts.That is, if sites are ordered by the first-stage impacts, with site 1 being the one with the smallest impact () and site K being the largest (), the first-stage impact in the kth site (k=1,2,…,K) is characterized by . We derive corresponding expressions for the finite sample bias () and first-stage F statistic () in terms of and. We conclude that the ratio ofto is again (in approximation) the inverse of. By contrasting these results with those for the single instrument case, we then derive conditions under which multiple instruments reduce or increase finite sample bias from that of a single instrument. Intuitively, this condition implies that with regard to finite sample bias, the use of multiple instruments is preferred to the use of a single instrument if the increase (due to using multiple instruments) in the error induced variation is more than offset by the increase in the treatment induced variation.

For each of the aforementioned scenarios, this paper also considersthe finite sample bias problem in the presence of clustering (e.g. nested structure of students within classrooms, schools, or districts). We derive the corresponding expressions for in terms of and, as well as in terms of and. We observe that these expressions are parallel to those resulted in the absence of clustering. We conclude that other things being equal, clustering reduces the first-stage F statistic () and thereby increasesfinite sample bias.

To sum up, in this paper we study finite sample bias inherent to instrumental variables estimators in four cases: the simple case of a single mediator with a single instrument, with or without clustered structure in the data,and the case of a single mediator with multiple instruments, with or without considering data clustering. For each of these cases, we show that the treatment-control mismatch that randomly occurs in experimental settings is the source of this bias, which causes an error-induced variation in the predicted mediator. We also demonstrate why and how the F-statistic from the first-stage of a TSLS estimator is useful in studying the extent of the bias. In addition, from the bias standpoint, we conclude that the shift from the use of a single instrument to that of multiple instruments isonly preferred when there is enough variation in the first stage impact estimates (across sites) to compensate for the increase in error-induced variation with an increase in treatment-induced variation. We also demonstrate that a clustered data structure leads to a smaller F-value, and hence increased the bias.

What lies ahead?

The present paper represents only the first stage of a more comprehensive project to examine the new approach, its analytical problems and its application potential. Our future work will attempt to generalize findings to more realistic (and important) applications: analyses of multiple setting features based on multiple instrumental variables. In addition, to complicating the issues discussed in this paper, this more general situation raises several new analytic issues that must be considered closely before proceeding with the approach in future applications.

Appendixes

Appendix A. References

Angrist, J. D., G. Imbens, &D. Rubin (1996). Identification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association,91, 444-55.

Angrist, J. D. A. B. Krueger (2001). Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments. PrincetonUniversity, Working Paper 455, Industrial Relations Section, August.

Bloom, H. S. (1984). Accounting for No-Shows in Experimental Evaluation Designs.Evaluation Review,8(2), 225-46.

Bloom, H. S. (2005) Learning More from Social Experiments: Evolving Analytic Approaches. New York: Russell Sage Foundation.

Bloom, H. S., C. J. Hill, J. Riccio (2005). Modeling Cross-Site Experimental Differences to Find Out Why Program Effectiveness Varies. In Howard S. Bloom (Ed.),Learning More from Social Experiments: Evolving Analytic Approaches(pp. 37-74).New York: Russell Sage Foundation.

Bound, J., D. A. Jaeger, & R. M. Baker (1995). Problem with Instrumental Variables Estimation when the Correlation between the Instruments and the Endogenous Explanatory Variable is Weak. Journal of the American Statistical Association,90(430), 443-50.

Cook, T. D. (2001). Sciencephobia: Why Education Researchers Reject Randomized Experiments.Education Next, Fall, 63-68.

Gamse, B. C., H. S. Bloom, J. J. Kemple, & R. T. Jacob (2008). Reading First Impact Study: Interim Report (NCEE 2008-4016).Washington, DC: NationalCenter for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

Greenberg, D. H., R. Meyer, C. Michalopoulos, M. Wiseman (2003). Explaining Variation in the Effects of Welfare-to-Work Programs.Evaluation Review,27(4), 359-94.

Haegerich, T.M.,E. Metz (under review). The Social and Character Development Research Program: Development, Goals and Opportunities.Journal of Research in Character Education.

Hahn, J.J. Hausman (2003).“Weak Instruments: Diagnosis and Cures in Empirical Econometrics.” American Economic Review 93,118-125.

Hoxby, C. M. (2000). Does Competition among Public Schools Benefit Students and Taxpayers?American Economic Review, 90(5), 1209-1238.

Jackson, R., A. McCoy, C. Pistorino, A. Wilkinson, J. Burghardt, M. Clark, C. Ross, P. Schochet, P. Swank (2007). National Evaluation of Early Reading First: Final Report, U.S. Department of Education, Institute of Education Sciences, Washington, DC: U.S. Government Printing Office.

Jones, S. M., J. L. Brown, J. L. Aber (2008). Classroom Settings as Targets of Intervention and Research. In M. Shinn and H. Yoshikawa (editors) ChangingSchools and Community Organizations to Foster Positive Youth Development.New York: OxfordUniversity Press.

Levitt, S. (1997). Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime.The American Economic Review,87(3), 270-90.

Stock, J. H. F. Trebbi (2003). Who Invented Instrumental Variable Regression? Journal of Economic Perspectives,17(3), 177 – 94.

Morgan, M. (1990).The History of Econometric Ideas. Cambridge: CambridgeUniversity Press.

Reiersol, O. (1945). Confluence Analysis by Means of Instrumental Sets of Variables.Arkiv for Matematik, Astronomi och Fysik,32a(4), 1 – 119.

Waldman, M., S. Nicholson, N. Adilov (2006). Does Television Cause Autism? NBER Working Paper No. 12632.

Wright, P. G. (1928).The Tariff on Animal and Vegetable Oils. New York: Macmillan. Appendix B.

[1] Spybrook, 2007 identified 55 randomized studies of a broad range of interventions and Gamse et. al., 2008 and Jackson et. al., 2007 report on regression discontinuity studies of the federal Reading First and Early Reading First programs.

[2] An important exception involves a series of randomized tests of interventions for improving students’ social and emotional outcomes (Jones, Brown and Aber, 2008 and Haegerick and Metz, under review).

[3] Cook (2001) speculates about why, until recently, the education research community strenuously resisted randomized experiments.

[4] Greenburg, Meyer, Michalopoulos and Wiseman (2003) argue for using multi-site experiments to study moderators of program effectiveness.

[5] Stock and Trebbi (2003) and Angrist and Krueger (2001) discuss the history of instrumental variables analysis. According to Angrist and Krueger (2001, p.1) “The term “instrumental variables” originated with Olav Reiersol (1945); Morgan (1990) cites an interview in which Reiersol attributed the term to his teacher, Ragnar Frisch.”

[6]Instrumental variables analysis has being used to estimate causal effects. For example, Levitt (1997) used the approach to study the effects of police on crime, Waldman (2006) used the approach to study the effects of television viewing on autism, and Hoxby (2000) used the approach to study the effects of school competition on student achievement.

[7] This issue is often referred to as the problem of “compliance” to treatment in medical research (Angrist, Imbens and Rubin. 1996) or the problem of “no-shows” and “crossovers” in program evaluation research (Bloom, 1984 and Gennetian, Morris, Bos and Bloom, 2005).See Bloom (1984) and Angrist, Imbens and Rubin (1996) for early discussions of its application.

[8] The finite sample bias problem reflects two ways: (i) with respect to biased point estimates and (ii) with respect to incorrect statistical inferences. In this paper, we focused on biased point estimates and have little to say yet about incorrect inferences, which we shall explore in later work. Note that the finite sample bias is also known as the bias due to “weak instruments” (Bound et al., 1995).

[9] Note that in this special case, the expected value of the treatment induced variation in () stayed constant as K changed from 1 to K, but the expected value of the error induced variation in () increased by a factor of K as K changed from 1 to K. i.e., and .