Evaluation and Performance Measurement — 1 —Klerman

Performance Management Systems
and Evaluation:
Towards a Mutually Reinforcing Relationship

Jacob Alex Klerman

Principal Associate

Abt Associates

Abstract

Rigorous impact evaluation is needed to screen out programs and program innovations that are not effective or cost-effective. However, current strategy for rigorous impact evaluations is very expensive, leading to samples too small to detect the expected small impacts of incremental changes to ongoing programs. This paper argues that an alternative approach to rigorous impact evaluation—one that is better integrated with program operation and that exploits data collected as part of ongoing performance measurement systems—would be much cheaper and could therefore be used to assess the impact of incremental changes. Doing so is crucial since the kaizen approach to manufacturing suggests that big improvements are the cumulative result of many incremental changes.

Key words: Evaluation, Performance Measurement.

Paper prepared for “Improving the Quality of Public Services A Multinational Conference”; Moscow, June 27-29, 2011.

Address Correspondence to: Jacob Alex Klerman, Abt Associates, 55 Wheeler St., Cambridge, MA 02138-1168; Jacob_Klerman@ abtassoc.com

Evaluation and Performance Measurement — 1 —Klerman

Rigorous, independent program evaluations can be a key resource in determining whether government programs are achieving their intended outcomes as well as possible and at the lowest possible cost. Evaluations can help policymakers and agency managers strengthen the design and operation of programs. Ultimately, evaluations can help the Administration determine how to spend taxpayer dollars effectively and efficiently -- investing more in what works and less in what does not.

Peter Orszag, Director of Office of Management and Budget (2009;
emphasis added)

This paper considers the relationship between program operations, rigorous impact evaluation, and performance measurement systems.[1] We argue that current evaluations usually ask: Does this program work at all? The implicit policy response is to shut the program down. Given that most large social programs serve clear needs—and that the nature of the political process makes it difficult to kill even ineffective programs—a negative evaluation result for a program as a whole is not particularly useful.

In contrast to this question that current evaluations attempt to answer, program operators seek answers to a sequence of more detailed questions:[2] What incremental changes could we make to the program to make it more effective? Which aspects of the program are working? Are some program models to address our target population and our target program working better than others?

To a great extent because of the high costs of survey data collection, current evaluation strategy is too expensive to support the large samples and continuous exploration of incremental changes needed to address the program operator’s questions. This is unfortunate because the kaizen approach to manufacturing suggests that important improvements in productivity (for social program, in program impact) arise and cumulative effect of many incremental changes.

Building on ideas in earlier broad discussions of evaluation practice (e.g., Orr, 1999, 249-256; Coalition for Evidence Based Policy, 2007; Besharov, 2009), this paper sketches an alternative approach to evaluation that would be complementary to the current strategy. Where the current strategy mounts discrete, special-purpose, but expensive, up or down evaluations often of new programs; the alternative strategy described here is deliberately nearly the mirror image—a sequence of evaluations of, low cost, incremental changes to ongoing large programs. Crucially for the feasibility of this incremental approach, the alternative strategy replaces expensive survey data-based collection of information on outcomes

The balance of this paper develops this idea. The next section restates the conventional argument for rigorous impact evaluation. The second section describes current conventional rigorous impact evaluation strategy. This section argues that—to a great extent because of very high data collection costs for household surveys—this current strategy is so expensive as to be infeasible for studying the incremental changes that kaizen suggests are crucial and which would be useful to program operators and managers. The third section sketches an alternative approach. Among the key insights is that, in many cases, performance measurement information can replace expensive surveys, substantially driving down evaluation costs. Such lower costs are crucial to enabling the larger samples required to detect the small impacts that should be expected from incremental changes. The fourth section applies these ideas to Head Start. The final section summarizes the argument and discusses when it will work, and when it will not.

I.The Need for Rigorous Impact Evaluation

The argument for rigorous evaluation is straightforward and well-understood. Put bluntly, when evaluated rigorously, most programs are shown to be ineffective. One of the leading lights of the evaluation field summarized the results boldly and slightly tongue in cheek (Rossi, 1987):

The Iron Law of Evaluation: The expected value of any net impact assessment of any large scale social program is zero. The Iron Law arises from the experience that few impact assessments of large scale social programs have found that the programs in question had any net impact. The law also means that, based on the evaluation efforts of the last twenty years, the best a priori estimate of the net impact assessment of any program is zero, i.e., that the program will have no effect.

Rossi is rumored to have recanted from these laws in his later years, but his rhetoric remains a crucial counterweight to the infectious enthusiasm and optimism of program developers—and also to the optimism of the public policy community many of whom would so like to believe the claims of program developers.

Recent reviews of the available rigorous impact evaluation literature are not wildly inconsistent with Rossi’s perspective. Thus, nearly a half century after the start of rigorous impact evaluation of social programs, the Coalition for Effective Social Policy’s “Top Tier Evidence” program has identified only eight “top tier” programs; with only another three deemed “near top tier”, and another fifteen deemed “promising” ( While one can quibble with the Coalition’s standards (GAO, 2009), few disagree with the basic point: Despite hundreds of rigorous evaluations over the last three decades, the programs that have passed rigorous evaluation are rare.

Furthermore, as Besharov (2009) emphasizes this lack of evaluated new programs is not the major issue. The major issue is existing, ongoing, programs. Most current programs have—at best—unknown effectiveness. When there is strong evidence of effectiveness of existing programs, the evidence is often dismal.

Perhaps the leading evidence of this phenomenon is Head Start, a program that we return to in the penultimate section of the paper. Head Start is a program intended to provide comprehensive education, health, nutrition, and parent involvement services to nearly a million low-income pre-school children, in over 1,000 sites, at a cost of over $7 million. While the program has some impacts while children participate in the program, the random assignment Head Start Impact Study (Puma, et al., 2010) found only minimal impacts by first grade. Thus, while the program provides child care for many low income children, it fails at its primary goal—to provide substantial, long-run, educational benefits.

But Head Start is not the exception; instead, it is the rule. Sawhill and Baron (2010) count ten major social policies that have been subject to random assignment trials—only one has shown clear evidence of positive impact (Early Head Start). The other nine (including Upward Bound and Job Corps) have found no impact.

Beyond finding negative results, current up or down evaluations appear to be—in a policy sense--ineffective. Sometimes negative evaluation results affect funding. The JTPA Evaluation (Bloom, et al, 1997) appears to have had a major role in the sharp cuts to the Department of Labor’s youth training programs. But, this example is notable mostly because such a relation of negative results to cuts in funding is so rare. Despite the negative evaluation results just reviewed—for Head Start, for Upward Bound, and for Job Corps—those programs continue, with large budgets—budgets that could be allocated to programs that do work and thereby contributing to real progress in ameliorating the nation’s social problems.

We are not the first to have noticed this disjunction. In fact, our rhetoric here is quite similar to that of the Obama Administration. In announcing a new “Increased Emphasis on Program Evaluations”, The Director of the Officer of Management and Budget, Peter Orszag (2009): wrote: “Although the Federal government has long invested in evaluations, many important programs have never been formally evaluated -- and the evaluations that have been done have not sufficiently shaped Federal budget priorities or agency management practices.”

If most programs do not work—i.e., they are not effective or cost-effective—then it is prudent to evaluate them before implementing them. A similar practice with respect to incremental program changes also seems prudent.

That testing is hard. Again, quoting Rossi (1987):

The Stainless Steel Law of Evaluation: The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero. This law means that the more technically rigorous the net impact assessment, the more likely are its results to be zero—or no effect. Specifically, this law implies that estimating net impacts through randomized controlled experiments, the avowedly best approach to estimating net impacts, is more likely to show zero effects than other less rigorous approaches.

While the evaluation community has an ongoing debate about exactly which methods are “good enough”[3], the need for high quality and rigorous impact evaluation is not in dispute. This is true for at least three reasons.

  1. Follow-Up: Proper evaluation of impact requires observing outcomes for the entire treated population. Casual methods are often subject to subtle and crucial biases in who is observed. Success stories come back (or report back) to the provider; failures go elsewhere.
  2. Required Sample Sizes: Inter-person variation is very large and plausible program impacts are often modest. Together these stylized facts imply a need for large to very large samples. For up or down evaluations of training programs, required sample sizes are usually calculated at 500 or even a thousand (see below for some details). Evaluations of incremental changes—with presumably small impacts—would require sample several times that—often 5,000 or more. This is more trainees than most sites see in a year. In as much as these power calculations are correct, no causal observations—or even formal evaluations—at any given site can be informative.
  3. Counterfactuals: Programs (might) observe outcomes for those they treat. However, evaluation seeks to know impact; i.e., outcomes for those treated relative to what would have happened to those sample people, in that same period, if they had not been treated. Forming that “counterfactural” is challenging. Selection into a program—by the participants and by the program—will usually lead to a presumption that those who are not treated are systematically different that those who are treated. Furthermore, even if that was not true, an appropriate untreated group needs to be identified and information on their outcomes needs to be collected.

These challenges imply that rigorous impact evaluation is needed, but that doing so is going to be a major effort. Casual observation will often lead to the wrong prescription; choosing the wrong programs and incrementally changing them in the wrong ways.

II.Current Evaluation Strategy

Current evaluation practice emerges from the interaction of the only intermittent focus on evaluation and the technical challenges. Current practice involves a cyclical evaluation scheme. Evaluation focus cycles through programs, major and minor with recent events and which party controls the presidency (and, to a lesser extent, Congress). As focus lands on a program, the timeline begins. A lead agency is chosen to oversee the evaluation and planning begins. Experts are consulted, perhaps a design contract is awarded. A year or more later, the lead agency issues an RFP for the evaluation; bids are reviewed; a winner is selected; and a second design phase begins. Two to four years after the initial decision to evaluate, site recruitment begins and the first people enter randomization. Depending on the details of the program, site recruitment and randomization will often take one or even two years. Service delivery will often take a year. Outcomes of interest are observed beginning about a year after the end of program delivery. Given concern about fade-out of effects, outcomes three to five years later are often of substantial interest. (Outcome ten or more years later are also of interest, but initial findings cannot wait for those truly long-run outcomes). Once outcome data are collected, analysis, report writing, client review, and revision begin. A five year time line is nearly the minimum; ten years is closer to average; fifteen years is not uncommon (Orr, 1999, 240-241).

Clearly evaluation timelines are long. But beyond being long, they are also expensive, very expensive. Current conventional practice is to run a survey to collect outcome data on treatments and controls. Locating control cases at the end of the program (for treatments) is always a challenge. Locating treatment cases who do not complete treatment is also a challenge. Six to twelve months after treatment, locating anyone, even those who complete the treatment, is likely to be a challenge. Given the cost of locating, once we do locate them, we run long surveys, 30 to 45 minutes. As a result, survey cost per case is likely to be several hundred dollars.

Few single sites will have large enough samples. Instead, we will need to recruit multiple sites and then design randomization for each site. Furthermore, site recruitment is going to be hard. Programs exist to serve clients. Program staff do social program work because they want to serve clients. The current grand evaluation strategy involves randomly assigning some clients to “no service”. Programs won’t like that; it runs against their goals and it hurts their reputation among potential future clients (“that program will make you apply and then turn you away.”). Program staff won’t like that; they want to serve clients. Together, these factors make site recruitment expensive and lead to a sample of sites which is not representative (e.g., the JTPA Evaluation; Bloom, et al., 1997).

Together, the components of this grand strategy imply high costs. A thousand or more cases times several hundred dollars per survey times several surveys; data collection alone will run several million dollars. Add in funds for site recruitment and customizing randomization for each site, plus some funds for analysis and write-up; the cost of a rigorous impact evaluations will usually start at about $5 million; $10m or higher will not be uncommon.

With timelines and costs like this, evaluation must be a cyclical activity. If we wait for one evaluation to end before starting the next one, few programs will be evaluated more than once a decade. With costs this high, even robust evaluation budgets will not support very much evaluation.

III. Current Evaluation Practice

The previous analysis is for a “best case”: The goal is an up or down evaluation. This paper began by arguing that such up or down verdicts are not very useful, especially for ongoing programs with strong political support and addressing real social problems. “Up” verdicts imply no change; “down” verdicts imply doing something—cancelling the program—which is unlikely to occur.

Instead, we argued that evaluations would be more useful if they answered questions such as which program design works better? What is the optimal staff/client ratio? What is the optimal education level for staff? How much initial training should staff get? How much in service training should staff get? Which curriculum should we use?

These better questions are amenable to random assignment evaluation, but the alternative is even more expensive. If we want to test alternative program models in the conventional approach, each program model can share a control group, but we need additional sample for the treatment group. Thus, testing not one program model, but two program models, will cost approximately half again as much as testing one program model. Testing three program models will cost twice as much as testing one.

In the language of random assignment evaluation, this is a multi-armed experiment. Thus, MFIP tested not just changing the welfare benefit structure, but also changes in the benefit structure plus increased counseling (Knox, et al, 2000). The NEWWS evaluation tested HCD/Human Capital Development (i.e., education and training) vs. LFA/Labor Force Attachment (i.e., job search assistance) vs. control (Hamilton, 2002). SSA’s BOND evaluation will test a change to the benefit structure alone, and in combination with more intensive counseling (Stapleton, et al., 2010).