On Aligning Formative and Summative Assessment1

On Linking Formative and Summative Functions

in the Design of Large-Scale Assessment Systems

Richard J. Shavelson[1]

StanfordUniversity

Paul J. Black, Dylan Wiliam

KingsCollegeLondon

Janet Coffey

StanfordUniversity

Submitted to Educational Evaluation and Policy Analysis

Abstract

The potential for mischief caused by well-intended accountability systems is legion when expedient output measures are used as surrogates for valued outcomes. Such misalignment occurs in education where external achievement tests are not well aligned either with curriculum/standards, or high-quality teaching methods. Tests become outcomes themselves driving curriculum, teaching and learning. We present three case studies that show how polities have crafted assessment systems that attempted to align testing for accountability (“summative assessment”) with testing for learning improvement (“formative assessment”). We also describe how two failed to achieve their intended implementation. From these case studies, we set forth decisions for designing accountability systems that link information useful for teaching and learning with information useful to those who hold education accountable.

1

On Aligning Formative and Summative Assessment1

On Linking Formative and Summative Functions

in the Design of Large-Scale Assessment Systems

The proposition that democracy requires accountability of citizens and officials is a universal tenet of democratic theory. There is less agreement as to how this objective is to be accomplished—James March and Johan Olsen (1995, p. 162)

Democracy requires that public officials, public and private institutions, and individuals be held accountable for their actions, typically by providing information on their actions and imposing sanctions. This is no less true for education, public and private, than other institutions; the demand for holding schools accountable is part of the democratic fabric.

The democratic concept of accountability is noble. However, in practice it can fall short of the ideal. For, as March and Olsen (1995, p. 141) put it, “The events of history are frequently ambiguous. Human accounts of those events characteristically are not. Accounts provide interpretations and explanations of experience” that make sense within a cultural-political framework. They (March & Olsen, p. 141) go on to point out that, “Formal systems of accounting used in economic, social and political institutions are accounts of political reality” (e.g., students’ test performance represent outcomes in a political reality). As such, they have profound impact on actors involved. On the one hand, accounts focus social control and make actors responsive to social pressure and standards of appropriate behavior; on the other hand, they tend to reduce risk-taking that might become public, make decision making cautions about change, and reinforce current courses of action that have appeared to have failed (March, 200?).

In this paper our focus is on human educational events—teaching, learning, outcomes—that are by their very nature ambiguous but get accounted for unambiguously in the form of test scores, league tables, and the like with significant impact on education. For accountability information, embedded in an interpretative, political framework, is, indeed, a very powerful policy instrument. Yet the potential for mischief is legion. Our intent is to see whether or not we can improve on one aspect of accountability—that part that deals with information—to improve both its validity and positive impact. Specifically, our intent is to set forth design choices for accountability systems that link, to a greater or lesser extent, information useful to teaching and learning with information useful to those holding education responsible—policy makers, parents, students, and citizens. For the current mismatch between information to improve teaching and learning, and information to inform the public of education quality has substantial negative consequences on valued educational outcomes.

Before turning to the linkage issue, a caveat is in order. To paraphrase March and Olsen, educational events are frequently ambiguous… and messy, complicated and political. To focus this paper, we are guilty in the portrayal of educational events and our “human accounts of those events are characteristically not [ambiguous]” but should be. Teachers, for example, face conflict every day in gathering information on student performance–to help students close the gap between what they know/can do and what they need to know/be able do on the one hand, and to evaluate students’ performance for the purpose of grading on the other hand. This creates considerable conflict and complexity that, due to space, we have reluctantly omitted (see, for example, Atkin, Black & Coffey, 2002; Atkin & Coffey, 2001). .

The Linkage Issue

In a simple minded way, we might think of education as producing an outcome such as highly productive workers, enlightened citizens, life-long learners or some combination of these. Inputs (resources) are transformed through a process (use of resources) into outputs (products) that are expected contribute to the realization of valued outcomes. “With respect to schools, for example, inputs include such things as teachers and students, processes include how teachers and students spend their time during the school day, outputs include increased cognitive abilities as measured by tests, and outcomes include the increased capacity of the students to participate effectively in economic, political and social life” (e.g., Gormley & Weimer, 1999, pp. 7-8).

When outputs are closely and accurately linked to inputs and processes on the one hand, and to outcomes on the other, the accounts of actors and their actions may be valid—they may provide valuable information for improving the system (processes given inputs) on the one hand, and for accounting to the public for outcomes on the other. However, great mischief can be done by an accountability system that does not, at least, closely link outputs to outcomes. For outputs (e.g., broad scope multiple-choice test scores) quickly become valued outcomes from the education system’s perspective, and the information provided by outputs may not provide information either closely related to outcomes or needed to improve processes.

Current educational accountability systems use large-scale assessments for monitoring achievement over time. They can be characterized as outputs (e.g., broad-spectrum multiple-choice or short-answer questions of factual or procedural recall) that are either distal from outcomes and processes or that have become the desired outcomes themselves. Consider, for example, the algebra test item: Simplify, if possible, 5a + 2b (Hart, Brown, Kerslake, Kuchemann, & Ruddock, 1985). Many teachers regard this item unfair for large-scale testing since students are “tricked” into simplifying the expression because of the prevailing “didactic contract” (Brousseau, 1984) under which students assume that there is “academic work” (Doyle, 1983) to be done; doing nothing cannot be academic work with the consequence of a low mark. The fact that they are tempted to simplify the expression in the context of a test question when they would not do so in other contexts means that this item may not be a very good question to use in a test for external accountability purposes.[2]

Indeed, testing for external accounting purposes may not align with testing for improving teaching and learning. Accountability requires test standardization while improvement may involve dynamic assessment of learning; testing for accountability typically employs a uniform testing date while assessment for improvement is on-going, testing for accountability leaves the student unassisted while assessment for improvement might involve assistance, results of accountability tests are delayed while assessment for improvement must be immediate, and testing for accountability must stress reliability while clinical judgment plays an important role over time in improvement (Shepard, 2003). Note that this version of accountability reflects current education practice. It is not inevitable, as Jim March reminds us: There is little agreement on how to accomplish democracy’s accountability function and this is especially true in education.

The potential for mischief and negative consequences is heightened when public accountability reports are accompanied by strong sanctions and rewards, such as student graduation or teacher salary enhancements. Test items may be stolen, teachers may teach the answers to the test, and administrators may change students’ answers on the test (e.g., Assessment Reform Group, 2002; Black, 1993; Shepard, 2003).

In this paper, we set forth design choices for tightening the links among educational outputs, the processes for improving performance on them, and desired outcomes. We begin with a framework for viewing assessment of learning for improving teaching-learning processes and for external accounting purposes. We then turn attention to large-scale assessment systems that have attempted alignment as existence proofs and as sources for identifying large-scale assessment design choices. To be sure, alignment has been attempted in the past with varying degrees of success. From these “cases” we extract design choices, choices that are made, intentionally or not, in the construction of large-scale assessment systems. We then show the pattern of design choices that characterize each of our cases.

Functions of Large-Scale Assessments: Evaluative, Summative and Formative

Large-scale testing systems may serve three functions: Evaluative, summative and formative. The first function is evaluative. The system provides (often longitudinal) information with which to evaluate institutions and curricula. To this end, the focus is on the system; samples of students, teachers, and schools provide information for drawing inferences about institutional or curricular performance. The National Assessment of Educational Progress in the United States and the Third International Mathematics and Science Study are examples of the evaluative function. We are not concerned here with this function[3]; rather we focus on summative and formative functions of assessment.

The second function is summative. The system provides direct information on individual students (e.g., teacher grades—not our focus here—and external test scores) and, by aggregation, indirectly on teachers (e.g., for salaries) and schools (for funding) for the purpose of grading, selecting, and promoting students certifying their achievement, and for accountability. The General Certificate of Secondary Education examination in the United Kingdom (e.g., is one example of this function as are statewide assessments of achievement in the United States such as California’s Standardized Testing and Reporting program ( or the Texas Assessment of Academic Skills ( The third function is formative. The system provides information directly and immediately on individual students’ achievement or potential to both students and teachers. Formative assessment of student learning can be informal when incidental evidence of achievement is generated in the course of a teacher’s day to day activities, when the teacher notices that a student has some knowledge or capacity of which she was not previously aware. Or it can be formal as a result of a deliberate teaching act designed to provide evidence about a student’s knowledge or capabilities in a particular area. This most commonly takes the form of direct questioning (whether orally or written),[4] but also in the form of curriculum embedded assessments (with known reliability and validity) that focus on some aspect of learning (e.g., mental model of the sun-earth relationship that accounts for day and night or a change in seasons) and that permits the teacher to stand back and observe performance as it evolves. The goal here is to identify the gap between desired performance and a student’s observed performance so as to improve student performance through immediate feedback on how to do so. Formative assessment also provides “feed-forward” to teachers as to where, on what, and possibly how to focus their teaching immediately and in the future.

While most people are aware of summative assessment, few are aware of formative assessment and the evidence of its positive, large-in-magnitude impact on student learning (e.g., Black & Wiliam, 1998). Perhaps a couple of examples of formative feedback, then, would be helpful. Consider, for instance, teacher questioning, a ubiquitous classroom event. Many teachers do not plan and conduct classroom questioning in ways that might help students learn. Rowe’s (1974) research showed that when teachers paused to give students an opportunity to answer a question, the level of intellectual exchange increased. Yet teachers typically ask a question and give students about one second to answer. As one teacher came to realize (Black, Harrison, Lee, Marshall, & Wiliam, 2002, p. 5):

Increasing waiting time after asking questions proved difficult to start with—due to my habitual desire to ‘add’ something almost immediately after asking the original question. The pause after asking the question was sometimes ‘painful’. It felt unnatural to have such a seemingly ‘dead’ period, but I persevered. Given more thinking time students seemed to realize that a more thoughtful answer was required. Now, after many months of changing my style of questioning I have noticed that most students will give an answer and an explanation (where necessary) without additional prompting.

As a second example, consider the use of curriculum-embedded assessments. These assessments are embedded in the on-going curriculum. They serve to guide teaching, and create an opportunity for immediate feedback to students on their developing understandings. In a joint project between the Stanford Education Assessment Laboratory and the Curriculum Research and Development Group at the University Hawaii, modifications have been made to the Foundational Approaches in Science Teaching (FAST) middle-school curriculum. A set of assessments designed to tap declarative knowledge (“knowing that”), procedural knowledge (“knowing how”) and schematic knowledge (“knowing why”) have been embedded at four natural transitions or “joints” in an 8-week unit on buoyancy—some assessments are repeated to create a time series (e.g., “Why do things sink or float?”) and some (multiple-choice, short-answer, concept-map, performance assessment) focus on the particular concepts, procedures and models that led up to the joints. The assessments serve to focus teaching on different aspects of learning about mass, volume, density and buoyancy. Feedback on performance is immediate, focuses on constructing conceptual understanding based on empirical evidence. For example, assessment items (graphs, “predict-observe-explain,” and short answer) that tap declarative, procedural and schematic knowledge are given to students at a particular joint. Students debate different explanations of sinking and floating based on evidence in hand.

While the dichotomy of formative and summative assessment seems perfectly unexceptional, it appears to have had one serious consequence. Significant tensions are created when the same assessments are required to serve multiple functions, and few believe that a single system can function adequately to serve both functions. At least two coordinated or aligned systems are required: formative and summative. Both functions require that evidence of performance or attainment is elicited, is then interpreted, and as a result of that interpretation, some action is taken. Such action will then, directly or indirectly, generate further evidence leading to subsequent interpretation and action, and so on.

Tensions arise between the formative and summative functions in each of three areas: evidence elicited, interpretation of evidence, and actions taken. First, consider evidence. As Shepard (2003) pointed out, issues of reliability and validity are paramount in the summative function because, typically, a “snapshot” of the breadth of students’ achievement is sought at one point in time. The forms of assessment used to elicit evidence are likely to differ from summative to formative. In summative assessment, typical “objective” or “essay” tests are given on a particular occasion. In contrast, with formative assessment, students’ real-time responses are given to one another in group work, to a teacher’s question, to the activity they are engaged in or to a curriculum-embedded test. Moreover, the summative and formative functions differ in the reliability and validity of the scores produced. In summative assessment, each form of a test needs to be internally consistent and scores from these forms need to be consistent from one rater to the next or from one form to the next. The items on the tests need to be a representative sample of items from the broad knowledge domain defined by the curriculum syllabus/standards. In contrast, as formative assessment is iterative, issues of reliability and validity are resolved over time with corrections made as information is collected naturally in everyday student performance. Finally, the same test question might be used for both summative and formative assessment but, as shown with the simplify item (simplify 5a + 2b), interpretation and practical uses will probably differ (e.g., Wiliam & Black, 1996).

The potential conflict between summative and formative assessment can also be seen in the interpretation of evidence. Typically, the summative function calls for a norm-referenced or cohort-referenced interpretation where students’ scores come to have meaning relative to their standing among peers. Such comparisons typically combine complex performances into a single number and put the performance of individuals into some kind of rank order. A norm- or cohort-referenced interpretation would indicate how much better an individual needs to do, pointing to the existence of a gap, rather than giving an indication of how that improvement is to be brought about. It tells the individual that they need to do better rather than telling him or her how to improve.[5]

The alternative to norm-referenced interpretation in large-scale assessment is criterion- or domain-referenced interpretation with focus on amount of, rather than rank ordering on, knowledge. Summative assessment, in this case, would report on the level of performance of individuals or schools (e.g., percent of domain mastered), perhaps with respect to some desired standard of performance (e.g., proportion of students above standard). In this case, the summative assessment would, for example, certify competence.[6]

Formative assessment, in contrast, provides students and teachers with information on howwell someone had done and how to improve, rather than on what they have done and how they rank. For this purpose, a criterion- or domain-referenced interpretation is needed. Such an interpretation focuses on the gaps between what a student knows and is able to do with what is expected of a student in that knowledge domain. However, formative assessment goes beyond domain referencing in that it also needs to be interpreted in terms of learning needs—it should be diagnostic (domain-referenced) and remedial (how to improve learning). The essential condition for an assessment to function diagnostically is that it must provide evidence that can be interpreted in a way that suggests what needs to be done next to close the gap.