DRAFT

Issues Related to Judging the Alignment of Curriculum

Standards and Assessments

Norman L. Webb

Wisconsin Center for Education Research

University of Wisconsin–Madison

Annual Meeting of the American Educational Research Association Meeting,

Montreal, April 11, 2005

This work was supported by a subgrant from the U. S. Department of Education (S368A030011) to the State of Oklahoma and a grant from the National Science Foundation, (EHR 0233445) to the University of Wisconsin–Madison. Any opinions, findings, or conclusions are those of the author and do not necessarily reflect the view of the supporting agencies.


Issues Related to Judging the Alignment of Curriculum Standards and Assessments

Norman L. Webb

Introduction

Alignment among policy documents, curriculum materials, and instructional practice has received increased importance over the past 10 to 15 years. In the early 1990s, a major tenet of the efforts toward systemic reform was to have the system components aligned with one another (Smith & O’Day, 1991). The Title I reauthorization legislation included the requirement that states use assessments aligned with curriculum standards, a requirement very much attuned to the theory of systemic reform, Elementary and Secondary Education Act, of 1965 (ESEA)—the Improving Americas School Act, of 1994 (IASA). Continuing with the same principle, the No Child Left Behind Act of 2001 made assessment in reading and mathematics more explicit and built in the requirement that states would have to indicate that their assessments in grades 3 through 8 and once during high school are aligned with challenging academic content standards.

Aware of the increasing importance of alignment, Webb (1997) wrote a monograph on the criteria for judging alignment for the National Institute for Science Education, encouraged by Andrew Porter, who was the principle investigator for the institute. This monograph discussed in some detail methods states and other jurisdictions used to determine alignment and the criteria that can to used to evaluate the alignment of a system. The monograph was written as one document related to the study of the evaluations of systemic reform motivated by the National Science Foundation’s systemic reform program and in close cooperation with the Council of Chief State School Officers (CCSSO). The monograph describes in detail criteria that can be used to judge the alignment between standards and assessments within an educational system. Five major alignment criteria developed by Webb include: content focus, pedagogical implications, equity, articulation across grades and ages, and system applicability.

During the mid-1990s, CCSSO devoted significant effort to analyzing state standards and was interested in a process for analyzing agreement between state standards and assessments. In cooperation with CCSSO, Webb then developed a process for doing a content analysis for judging the alignment between standards and assessments. This content analysis used four of six criteria identified under major content focus criteria described in the alignment monograph—categorical concurrence, depth-of-knowledge correspondence, range-of-knowledge consistency, and balance of representation.

In 1998, the newly developed alignment process was used for the first time to analyze, with the cooperation of CCSSO, the alignment of curriculum standards and assessment of four states. Four to five reviewers coded the depth-of-knowledge (DOK) levels of standards and the assessment items using paper-and-pencil forms. These data were hand-entered into an Excel file and then analyzed using procedures developed with the help of John Smithson.

Over the next two years, the alignment process was refined and used to conduct alignment analyses in additional states. The definitions for the depth-of-knowledge (DOK) levels for four content areas (reading and writing, mathematics, science, and social studies) were written and refined after each analysis. Another monograph on the alignment process was published by CCSSO in 2002 (Webb, 2002).

In 2003, the state of Oklahoma, with the cooperation of CCSSO and the Technical Issues for Large Scale Assessment (TILSA) collaborative, received a grant from the United States Department of Education to develop an electronic tool on a CD that could be used to do the alignment analysis. The work on the electronic tool was begun with the support of a grant from the National Science Foundation in 2002 for the purpose of providing technical assistance to the initiative to create Mathematics and Science Partnerships among K–12 school districts and institutions of higher education. The major work on the Web Alignment Tool (WAT) began in 2003. The alpha test for the WAT was conducted in Delaware in August, 2003, by analyzing standards and assessments from three states—Delaware English Language Arts (grades 3, 5, 8, and 10), mathematics (3 and 8), and science (4, 6, 8, and 11); South Carolina English Language Arts (grade 10) and science (high school biology); and Oklahoma (mathematics grade 8 and Algebra I) and science (high school biology). The on-line beta test of the tool was conducted in Delaware in September, 2002, for mathematics grades 5 and 10. The beta test of the CD version of the WAT was conducted in Alabama in January, 2004, for mathematics 3, 5, 7, and 9.

In 2004, the on-line WAT was used to conduct additional analyses for four states. Currently, the WAT exists as an on-line tool (r.wisc.edu/WAT), and on a CD. One dissemination conference on how to use the alignment tools has been conducted for states west of the Mississippi on February 28 and March 1, 2005, in Phoenix. A second dissemination conference is to be conducted for states east of the Mississippi in Boston on July 25 and 26, 2005.

The Webb alignment process is one of a handful of other processes (Blank, 2002). Porter and Smithson (Porter, 2002) developed a process referred to as the Survey of the Enacted Curriculum (SEC). Central to this process is a content-by-cognitive level matrix. Reviewers systematically categorize standards, assessments, curriculum, or instructional practices onto the matrix indicating the degree of emphasis in each cell. Comparisons, or the degree of alignment, are made by considering the amount of overlap of cells on the matrix between any two elements of the analysis (assessment and standards, curriculum and standards, standards and instruction, etc.).

Achieve, Inc., has developed another process that is based on a group of experts reaching consensus on the degree to which the assessment-by-standard mapping conducted by a state or district is valid. This process reports on five criteria: Content Centrality, Performance Centrality, Source of Challenge, Balance, and Range. For Content Centrality and Performance Centrality, reviewers reach a consensus as to whether the item and the intent objective(s) correspond fully, partially, or not at all. Achieve prepares an extensive narrative to describe the results from the review and will include a “policy audit” of standards and the assessment system if desired.

Webb Alignment Process

Generally, the alignment process is performed during a three-day Alignment Analysis Institute. The length of the institute is dependent on the number of grades to be analyzed, the length of the standards, the length of the assessments, and the number of assessment forms. Five to eight reviewers generally do each analysis. The larger number of reviewers will increase the reliability of the results. Reviewers should be content-area experts, district content-area supervisors, and content-area teachers.

To standardize the language, the process employs the convention of standards, goals, and objectives to describe three levels of expectations for what students are to know and do. Standard is used here as the most general (for instance, Data Analysis and Statistics). A standard, most of the time, will be comprised of a specific number of goals, which are comprised in turn of a specific number of objectives. Generally, but not always, there is an assumption that the objectives are intended to span the content of the goals and standards under which they fall.

Reviewers are trained to identify the depth-of-knowledge of objectives and assessment items. This training includes reviewing the definitions of the four depth-of-knowledge (DOK) levels and then reviewing examples of each. Then the reviewers participate in 1) a consensus process to determine the depth-of-knowledge levels of the state’s objectives and 2) individual analyses of the assessment items of each of the assessments. Following individual analyses of the items, reviewers participate in a debriefing discussion in which they give their overall impressions of the alignment between the assessment and the state’s curriculum standards.

To derive the results on the degree of agreement between the state’s standards and each assessment, the reviewers’ responses are averaged. Any variance among reviewers is considered legitimate, with the true depth-of-knowledge level for the item falling somewhere in between two or more assigned values. Such variation could signify a lack of clarity in how the objectives were written, the robustness of an item that can legitimately correspond to more than one objective, and/or a depth of knowledge that falls in between two of the four defined levels. Reviewers were allowed to identify one assessment item as corresponding to up to three objectives—one primary hit (objective) and up to two secondary hits. However, reviewers can only code one depth-of-knowledge level to each assessment item, even if the item corresponded to more than one objective. Finally, in addition to learning the process, reviewers are asked to provide suggestions for improving the process.

Reviewers are instructed to focus primarily on the alignment between the state standards and the various assessments. However, they are encouraged to offer their opinions on the quality of the standards, or of the assessment activities/items, by writing a note about the items. Reviewers can also indicate whether there is a source-of-challenge issue with the item—i.e., a problem with the item that might cause the student who knows the material to give a wrong answer, or enable someone who does not have the knowledge being tested to answer the item correctly. For example, a mathematics item that involves an excessive amount of reading may represent a source-of-challenge issue because the skill required to answer is more a reading skill than a mathematics skill. Source-of-challenge can be considered as a fifth alignment criteria in the analysis and was originally so defined by the Achieve, Inc.

The results produced from the institute pertain only to the issue of agreement between the state standards and the assessment instruments. Thus, the alignment analysis does not serve as external verification of the general quality of a state’s standards or assessments. Rather, only the degree of alignment is discussed in the results. The averages of the reviewers’ coding are used to determine whether the alignment criteria are met. When reviewers do vary in their judgments, the averages lessened the error that might result from any one reviewer’s finding. Standard deviations, which give one indication of the variance among reviewers, are reported.

To report on the results of an alignment study of a state’s curriculum standards and assessments for different grade levels, the study addresses specific criteria related to the content agreement between the state standards and grade-level assessments. The four alignment criteria receive major attention in the reports: categorical concurrence, depth-of-knowledge consistency, range-of-knowledge correspondence, and balance of representation.

Alignment Criteria Used for This Analysis

The analysis, which judges the alignment between standards and assessments on the basis of four criteria, also reports on the quality of assessment items by identifying those items with sources of challenge and other issues. For each alignment criterion, an acceptable level is defined by what would be required to assure that a student had met the standards.

Categorical Concurrence

An important aspect of alignment between standards and assessments is whether both address the same content categories. The categorical-concurrence criterion provides a very general indication of alignment if both documents incorporate the same content. The criterion of categorical concurrence between standards and assessment is met if the same or consistent categories of content appear in both documents. This criterion was judged by determining whether the assessment included items measuring content from each standard. The analysis assumed that the assessment had to have at least six items measuring content from a standard in order for an acceptable level of categorical concurrence to exist between the standard and the assessment. The number of items, six, is based on estimating the number of items that could produce a reasonably reliable subscale for estimating students’ mastery of content on that subscale. Of course, many factors have to be considered in determining what a reasonable number is, including the reliability of the subscale, the mean score, and cutoff score for determining mastery. Using a procedure developed by Subkoviak (1988) and assuming that the cutoff score is the mean and that the reliability of one item is .1, it was estimated that six items would produce an agreement coefficient of at least .63. This indicates that about 63% of the group would be consistently classified as masters or nonmasters if two equivalent test administrations were employed. The agreement coefficient would increase if the cutoff score is increased to one standard deviation from the mean to .77 and, with a cutoff score of 1.5 standard deviations from the mean, to .88. Usually, states do not report student results by standards, or require students to achieve a specified cutoff score on subscales related to a standard. If a state did do this, then the state would seek a higher agreement coefficient than .63. Six items were assumed as a minimum for an assessment measuring content knowledge related to a standard and as a basis for making some decisions about students’ knowledge of that standard. If the mean for six items is 3 and one standard deviation is one item, then a cutoff score set at 4 would produce an agreement coefficient of .77. Any fewer items with a mean of one-half of the items would require a cutoff that would only allow a student to miss one item. This would be a very stringent requirement, considering a reasonable standard error of measurement on the subscale.

Depth-of-Knowledge Consistency

Standards and assessments can be aligned not only on the category of content covered by each, but also on the basis of the complexity of knowledge required by each. Depth-of-knowledge consistency between standards and assessment indicates alignment if what is elicited from students on the assessment is as demanding cognitively as what students are expected to know and do as stated in the standards. For consistency to exist between the assessment and the standard, as judged in this analysis, at least 50% of the items corresponding to an objective had to be at or above the level of knowledge of the objective: 50%, a conservative cutoff point, is based on the assumption that a minimal passing score for any one standard of 50% or higher would require the student to successfully answer at least some items at or above the depth-of-knowledge level of the corresponding objectives. For example, assume an assessment included six items related to one standard and students were required to answer correctly four of those items to be judged proficient—i.e., 67% of the items. If three, 50%, of the six items were at or above the depth-of-knowledge level of the corresponding objectives, then for a student to achieve a proficient score would require the student to answer correctly at least one item at or above the depth-of-knowledge level of one objective. Some leeway was used in the analysis on this criterion. If a standard had between 40% and 50% of items at or above the depth-of-knowledge levels of the objectives, then it was reported that the criterion was “weakly” met.