Abstract Title Page

Title:

Multiple regression discontinuity design: Implementation issues and empirical examples from education

Authors:

Joseph P. Robinson, PhD, University of Illinois at Urbana-Champaign

Sean F. Reardon, EdD, Stanford University

2009 SREE Conference Abstract Template

Abstract Body

Background/context:

Regression discontinuity (RD) designs for inferring causality in the absence of a randomized experiment have a long history in the social sciences (see Cook, 2008). When the mechanism for selection into the treatment/control condition is known and observable, RD can provide close approximations to randomized experiments—far better than other methods termed quasi-experimental (Cook & Wong, in press). Traditional RD utilizes a discontinuity in the receipt of treatment along one continuous measure (referred to as the rating score or running variable), and estimates the treatment effect as the difference in observed outcomes between the groups falling on either side of the discontinuity. Some recent examples include studies of Reading First (Gamse et al., 2008) and remedial education (Jacob & Lefgren, 2004; Matsudaira, 2008).

However, many education policies rely on more than one rating score to determine program eligibility. For instance, state high school exit exams often impose strict passing scores for both mathematics and reading, forcing some modification to the traditional RD design in order to study program effects (e.g., Martorell, 2005, uses a modification for studying this in Texas). Similarly, rigid cutscores on multiple tests are used for determining services for English learners (see Robinson, 2008) and higher education financial aid programs (Kane, 2003).

Despite some very thoughtful recent work on single regression discontinuity (see the special issue of Journal of Econometrics, 2008), the current literature lacks a thorough examination of issues concerning the study of program effects when multiple thresholds are used to determine eligibility or participation. This paper addresses these issues. In particular, we:

(a)  present the general problem of multiple regression discontinuity and give some examples from educational practice and policy where it is applicable;

(b)  elaborate on the situation of multiple-treatment categories that can arise with multiple RD;

(c)  define the estimands one can obtain from multiple RD;

(d)  discuss strategies employed in the existing literature;

(e)  present different strategies for estimation/modeling, including the study of homogeneous and heterogeneous effects;

(f)  address related issues of power; and

(g)  discuss extension of the multiple RD design to cases of imperfect compliance.

Purpose/objective/research question/focus of study:

The general problem of multiple regression discontinuity and some examples

Unlike traditional RD with one rating score (see Thistlewaite & Campbell, 1960; Trochim, 1984), multiple RD is feasible when minimum cutscores exist on more than one rating score dimension.[1] We use examples from the California high school exit exam (CAHSEE) and reclassification of English learners as empirical examples. Passing the CAHSEE requires a minimum score on both mathematics and reading (i.e., 2 two rating score variables); and, in many districts in California, English learners must attain passing scores on 5 five assessments to be termed “fluent English proficient.”

Multiple treatment categories that can arise with multiple RD

With the familiar traditional RD design, individuals are either above or below the discontinuity, resulting in one treatment group and one control group. However, multiple treatment categories can arise in a multiple RD context. For instance, though it appears that the CAHSEE has only two conditions (passing and getting a diploma, or failing some content area(s) and not getting a diploma), there are really four treatment categories: passing both tests, passing only the mathematics test, passing only the reading test, and passing neither test. The reason these four categories exist (rather than two) is twofold: (1) schools react to students failing each test by providing remedial services for each test failed, resulting in roughly double remediation for failing both content area assessments; and (2) the psychological effects of failing two tests can be greater than failing one test. The four treatment categories are illustrated (in a general framework) in Figure 1.

Figure 1 presents a stylized example of two rating score variables (rating scores variables X and Y) used to determine treatment status, though the ideas discussed can be generalized to higher dimensions. The minimum cutscores on X and Y are indicated by the solid lines. Students scoring above both thresholds receive treatment T1. Students falling short of the minimum X score but passing the minimum Y score receive T2. Missing the minimum score on both X and Y results in treatment T3. And passing the minimum for X but not Y gives students treatment T4.

(Figure 1 here)

For the CAHSEE, T1 is analogous to passing both mathematics and reading, and therefore passing the CAHSEE. T2 would be passing reading, but failing mathematics; and T4 would be the opposite. T3 would be failing both. In essence, there are 4 treatment scenarios for the CAHSEE because failing one test, rather than both tests, has different consequences on remediation strategies, and may also have different motivational/psychological effects on students.

For reclassification of English learners (assuming only two thresholds, to make this point), there is only one treatment condition and one control: students are either reclassified (T1) or remain English learners (T2, T3, and T4). Services for English learners do not vary based upon the type of assessment failed: that is, English learners falling short on the ELD writing assessment get the same ancillary services as English learners who did not pass the test of academic English. Of course, one could argue there are psychological effects related to failing more than one threshold for reclassification and deem it necessary to breakdown the control group into separate groups by T2, T3, and T4, much like we described in the CAHSEE example. This is acceptable; however, we reiterate that in the CAHSEE example, the instructional services students receive differ by the number and type of tests failed, whereas services are equivalent for English learners failing any number and any type of reclassification thresholds.

Therefore, the unique situation of the education policy and instructional services, combined with a priori expectations of passing various thresholds, should guide considerations of modeling the number of treatment and control categories.

Estimands obtained from multiple RD

With traditional (single-)RD, the estimand is the treatment effect at the discontinuity point. However, in many cases (including the CAHSEE example), multiple RD allows estimation of many different estimands. For example, one could estimate the effect of treatment T1 versus treatment T2 (see Figure 1), along discontinuity A. Another possible estimand is T1 versus T4, along discontinuity D. Or, for individuals along the y=x line, we could estimate the effect of treatment T1 (e.g., passing both tests) versus treatment T3 (e.g., passing neither test).

Strategies employed in the existing literature and presentation of different approaches

Broadly, the current study deals with methodological issues related to evaluations of policy and practice using a RD design with multiple rating score variables, with and without perfect compliance to attaining the thresholds. In turn, we provide a framework for modeling homogeneous effects and heterogeneous effects (i.e., exploring if treatment effects differ by which threshold is the binding constraint). In this abstract, however, we limit our discussion to homogeneous effects in the case of perfect compliance.

With perfect compliance, the multiple RD case can be studied as an extension to the single RD. But unlike the single RD case, the researcher has a choice of estimation strategies, which can sometimes lead to different estimands.

Approach #1. One option is to construct a new running variable (M) as the minimum of the rating scores (Rjs), after (if necessary) standardizing the rating scores and recentering each Rj around its cutscore.

, , (1)

where N is the number of rating score variables. Then, construct a new variable signifying the minimum score is non-negative (i.e., if). Using Z as the discontinuity point and M as the running variable, run the RD (see Equation (2)). Here, the outcome is performance P of student i. This approach, taken by Martorell (2005), reduces the multidimensionality of the multiple thresholds to a case of only one threshold because only the minimum score is the binding constraint. The coefficient on Z is the effect of the treatment, weighted by the proportion of minimum scores from each rating score. For example with two tests determining treatment status, if test X (and not test Y) is more often the binding constraint, then the estimated effect is more a reflection of the effect of just barely attaining the threshold on X (having already attained Y’s threshold), as shown in Figure 2.

(2)

Approach #2. A second approach is to model the discontinuity as a displacement in a multidimensional surface, rather than reducing the problem to a single RD. In this approach, taken by Robinson (2008), a new variable (Z) indicates successful passage of all thresholds. Along with Z, functions of each rating score are included in the estimation model, as in Equation (3). The coefficient on Z is the effect of the treatment, weighted by the density of observations along each threshold. The interpretation of this surface RD is the same as above for the reduced RD case.

(3)

Approach #3. A third approach is to simply subset the data by individuals attaining one cutscore, and then model the discontinuity along the other threshold. This again reduces the problem to a single RD, but the estimand is different from the reduced RD in Martorell (2005). In this case, and again referring to Figure 1, the estimand is the treatment effect of T1, relative to treatment T2 (or T1, relative to T4).

Approach #4. A related fourth approach is to utilize all data points, and estimate the RD along one rating variable, incorporating instrumental variables in a “fuzzy” RD design. Similar to the third approach, this approach limits inferences, this time to individuals who would receive the treatment if they had attained the threshold. Looking at Figure 1, we would use a two-step procedure, where the first-stage equation predicts treatment status T1 as a function of X and attaining the threshold marked by the vertical line AC (where AC=1 indicates passing the threshold). The first stage is given in Equation (4).

(4)

The second-stage equation (Equation (5)) uses the predicted value of T1 to estimate the treatment effect, from Equation (5).In principle, and assuming proper functional forms, approach #4 should yield the same estimate as the third approach, though there may be less power to estimate the point precisely (due to the instrumental variable).

(5)

Again, the full paper provides more details on each of the above cases, as well as details regarding estimands and estimation strategies for cases of imperfect compliance and the study of heterogeneous effects. Table 1 provides a brief overview of some sample strategies for various estimands, focusing on approaches #1 (reduced RD) and #2 (surface RD).

(Table 1 here)

Issues of power

Because RD bases its effects estimation largely off of points very near the discontinuity, it is preferable to have a high density of observations near the cutoff. The same is true of multiple RD. Figure 1 shows an increasing concentration of observations as we approach the intersection of the two cutscores. This is a reasonably good situation regarding power (though a better situation would concentrate points even more tightly along the thresholds). If, however, the threshold of the Y rating score were shifted (as in Figure 2), the power to detect effects along thresholds B and D is severely diminished.

(Figure 2 here)

Extensions to the case of imperfect compliance

Much like the traditional RD case, situations arise when the discontinuity does not determine treatment status, but rather predicts treatment status. With traditional RD, Trochim (1984) refers to the determining case as a “sharp” RD and the predicting case as a “fuzzy” RD. Instrumental variables (IV) is combined with RD to estimate treatment effects on compliers in a fuzzy RD.

Similarly, we endorse the use of IV in the context of multiple RD. In the paper, we explain how IV can be incorporated in the study of homogeneous effects and heterogeneous effects. The case of heterogeneous effects involves creating additional instruments using interaction terms, but is a fairly straightforward extension of the homogeneous case. We again refer to Table 1 which briefly illustrates how the use of instrumental variables alters the interpretation of the effect estimates, restricting generalizations to compliers (i.e., local average treatment effect on the treated, or LATT).

Two Empirical Education Examples (CAHSEE and Reclassification):

Both the CAHSEE and reclassification studies use longitudinal data on multiple cohorts from several large, urban school districts in California. The CAHSEE analyses contain longitudinal data on more than 70,000 students, and the reclassification dataset contains longitudinal data on more than 20,000 English learners. Here, we provide additional context on each issue, especially as the contexts relate to heterogeneity of effects, which can be explored using multiple RD. For both studies, we employ various multiple RD approaches and compare estimates, as well as explore heterogeneity of effects.

CAHSEE study

When students fail a CAHSEE assessment, school districts typically react by providing the failing student with remedial services/instruction or additional courses following the end of the normal school day. Take an example of two students: student 1 just barely failed the mathematics assessment and has a high reading score, while student 2 just barely failed the mathematics assessment and has a low reading score. Are these remedial mathematics services equally beneficial to students with low and high reading scores, or is there a differential effect?

Failing the CAHSEE can occur at multiple time points (e.g., spring of 9th grade, fall of 10th grade), but non-remediable failure occurs in spring of 12th grade. Therefore, the duration of the intervention differs depending on date of failure until date of passing. In most cases, failure results in the school district providing remedial courses in the content area(s) failed. However, the “treatment” of failing the CAHSEE encompasses more than remediation: the effects are partly psychological and de-motivational. We explore effects on future achievement scores and graduation, and find no evidence of an effect.