Educational Assessment

Education 276

Winter 2007-08, MW3:15-5:05

Education ("Cubberley"), room 207

Edward H. Haertel

Education 276 is intended primarily for graduate students at both masters and doctoral levels. It offers an overview of topics and issues in educational testing and assessment, including an introduction to concepts of reliability, validity, bias, and fairness; educational policy uses of achievement tests; and sources of group differences in test performance.

There are two 90-minute class meetings per week, following a lecture/discussion format. Students are expected to complete assigned readings prior to each class period, and to participate in class discussions. Two brief response papers are required, each 4-5 pages long (about 1000 words), to be turned inwithin one weekofthe class meeting devoted to the topic chosen. Each student may choose which two class sessions she/he wishes to write response papers for. A 10- to 15-page research paper, on a topic of the student's choice (subject to instructor approval),will be due at the end of the quarter. There will be no midterm or final. Letter grade, S/NC optional.

To save on printing costs and on charges for use of copyrighted materials, I have not included materials in the reader that are readily available online. These readings are indicated by asterisks. Some are publicly available and others (e.g., journal articles) are available to Stanford users. Please plan to complete readings before each class session.

Tentative course schedule

W 1/ 9 Writing questions, building tests

In-class activity, looking at construction of achievement test items; discussion of question formats.

M 1/14 An overview of educational assessment

Classroom assessment, accountability testing, admissions testing, and testing for individual placement/diagnosis. The business of testing. Sampling of test publishers' websites (list of urls to be distributed in class).

*The marketplace for educational testing. NBETPP Statements, 2(3), April 2001. down to Vol. 2 No. 3.)

*Whiplash From Backlash? The Truth About Public Support for Testing, NCME Newsletter, Vol. 9, No. 3, September 2001. ( scroll down to September 2001 Newsletter. Article begins at bottom of 3rd page.)

*Phelps, Richard P. (1997). The extent and character of system-wide student testing in the United States. Educational Assessment, 4(2), 89-121. (Available through E-Journals, following links from This article is long, and a bit dated. Okay to skim.

Phelps, Richard P. (1998). The demand for standardized student testing. Educational Measurement: Issues and Practice, 17(3), 5-23. (Available through E-Journals, following links from

W 1/16 Psychological foundations of assessment

Alternative models from cognitive and behavioral psychology that have served as underpinnings for educational assessment

*National Research Council. (2001). Advances in the sciences of thinking and learning. In Knowing what students know: The science and design of educational assessment (Ch. 3, pp. 59-110). Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., & Glaser, R. (Eds.). Washington, DC: National Academy Press. (View and/or print one page at a time at Note: This is a lot of material. If you lack time/interest for the entire chapter, pay special attention to pp. 59-64 and the concluding subsections titled "Implications for Assessment" beginning on pages 71, 79, 90, 96, and 101. Also be sure to review the examples of assessments in "Boxes" throughout the chapter. I also strongly recommend the material on brain science, pp. 104-110.)

Baxter, G. P., & Glaser, R. (1998). Investigating the cognitive complexity of science assessments. Educational Measurement: Issues and Practice, 17(3), 37-45.

(Readings for this date continued on next page)

*Bransford, J.D. & Schwartz, D.L. (1999). Rethinking transfer: A simple proposal with multiple implications. In A. Iran-Nejad & P.D. Pearson (Eds.), Review of Research in Education (vol. 24, pp. 61-101). Washington: American Educational Research Association. (Available through E-Journals, following links from

M 1/21 Martin Luther King, Jr. Day (University Holiday, no classes)

W 1/23 Measurement concepts

Reliability and validity; definitions and properties of derived score scales; properties of individual-level versus aggregate-level test scores.

*Traub, Ross E., & Rowley, Glenn L. (1991). NCME Instructional Module: Understanding reliability. Educational Measurement: Issues and Practice, 10(1), 37-45. (Available following link from Scroll down to Module 8. Please read pp. 171-175. Okay to stop when you get to "Estimating Reliability," unless you find remaining material of interest.)

*Green, Bert F. (1981, October). A primer of testing.American Psychologist, 36, 1001-1011. (Available through E-Journals, following links from

*Messick, Samuel. (1995, September). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749. (Available through E-Journals, following links from

M 1/28 Bias and fairness in educational testing

Definitions of bias and fairness; item bias versus test bias; how bias is addressed in the creation of published tests; rudiments of statistical approaches to biased item detection; the Golden Rule approach and why it was a bad idea.

Introduction to chapter 7 of the Standards for Educational and Psychological Testing, "Fairness in testing and test use," pp. 73-80.

*Clauser, Brian E., & Mazor, Kathleen M. (1998). NCME Instructional Module: Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44. (Available following link from Scroll down past Module 17. The Clauser and Mazor module is about 4th from the bottom. Please read pp. 31-34 and just skim the remainder, unless more mathematical material is of interest.)

*The impact of score differences on the admission of minority students: An Illustration. NBETPP Statements, 1(5), June 2000. ( scroll down to Vol. 1 No. 5.)

W 1/30 The "standards" in "standards-based educational reform."

Methods of standard setting, roles of different participants in standard-setting process.

*Haertel, E. H., & Lorié, W. A. (2004). Validating standards-based test score interpretations. Measurement: Interdisciplinary Research and Perspectives, 2(2), 61-103. (Available through E-Journals, following links from Please note: due to typo in journal name, it may be best to search using only keywords "measurement interdisciplinary." Please read pp. 61-82. Okay to stop at "Selected Applications.")

Haertel, E. H. (2002). Standard setting as a participatory process: Implications for validation of standards-based accountability programs. Educational Measurement: Issues and Practice, 21(1), 16-22.

*Holbrook, P. J. (2001, June). When Bad Things Happen To Good Children: A Special Educator's Views of MCAS. Phi Delta Kappan, 82(10), 781-785. (Available through E-Journals, following links from

M 2/ 4 Models for measurement as a policy tool

Aptitude testing and tracking; testing and program evaluation; measurement-driven instructional management systems; criterion-referenced testing and mastery learning; minimum competency testing; authentic assessment; school-level accountability systems.

Haertel, E. H., & Herman, J. L. (2005). A historical perspective on validity arguments for accountability testing. In J. L. Herman & E. H. Haertel (Eds.), Uses and misuses of data for educational accountability and improvement (The 104th yearbook of the National Society for the Study of Education, Part 2, pp. 1-34). Malden, MA: Blackwell.

Haertel, E. H. (1999). Validity arguments for high-stakes testing: In search of the evidence. Educational Measurement: Issues and Practice, 18(4), 5-9.

*Elmore, Richard. (2002). Testing Trap. Harvard Magazine, available from

W 2/ 6 Performance Assessment, "Authentic Assessment"

The logic of attempts to harness the power of measurement-driven instruction ("What you test is what you get") for good rather than ill, with tests we would want teachers to teach to.

**Resnick, L. B., & Resnick, D. P. (1992). Assessing the thinking curriculum: New Tools for Educational Reform. In B. R. Gifford & M. C. O'Connor (Eds.),Changing assessments: alternative views of aptitude, achievement and instruction (pp. 37-75). Boston: Kluwer. (On reserve in Cubberley Library)

(Readings for this date continued on next page)

*Haertel, E. H. (1999, May). Performance assessment and education reform. Phi Delta Kappan, 80(9), 662-666. (Available through E-Journals, following links from

*Shavelson, R. J., Baxter, G. P., & Pine, J. (1992, May). Performance assessments: Political rhetoric and measurement reality. Educational Researcher, 21(4), 22-27. (Available through E-Journals, following links from or

M 2/11 Findings on the effectiveness of measurement-driven reform

Empirical research on the effectiveness of accountability testing in raising educational achievement. Taken together, these papers address both testing systems with stakes for individual students and testing systems with stakes for schools.

*Bishop, J. H. (1997). The effect of national standards and curriculum-based exams on achievement. American Economic Review, 87(2), 260-264. (Available through E-Journals, following links from

*Carnoy, M., & Loeb, S. (2002). Does external accountability affect student outcomes? A cross-state analysis. Educational Evaluation and Policy Analysis, 24(4), 305-331. (Available through E-Journals, following links from

*Grissmer, D., & Flanagan, A. (1998, November). Exploring Rapid Achievement Gains in North Carolina and Texas. Lessons from the States (Paper commissioned by the National Education Goals Panel). (Available at (Okay to skim, but look for connection [or lack of connection] between data and conclusions.)

*Klein, S. P., Hamilton, L. S., McCaffrey, D. F., Stecher, B. M. (2000). What Do Test Scores in Texas Tell Us? (RAND Issue Paper No. IP-202). (Available at

W 2/13 The impact of testing on curriculum and instruction

Research and perspectives on the ways external testing may affect the processes of teaching and learning.

Madaus, G. F. (1988). The Influence of Testing on the Curriculum. In L. N. Tanner (Ed.), Critical Issues in Curriculum. Eighty-seventh yearbook of the National Society for the Study of Education. Chicago: University of Chicago Press. (excerpt)

*Shepard, L. A. (1997). Insights Gained from a Classroom-Based Assessment Project (CSE Technical Report 451). Los Angeles: CRESST, UCLA. (Go to enter 451 and click "Go")

Green, D.R. (1998). Consequential Aspects of the Validity of Achievement Tests: A Publisher's Point of View. Educational Measurement: Issues and Practice, 17(2), 16-19; 34.

M 2/18 President's Day (University Holiday, no classes)

W 2/20 From Minimum Competency Tests (MCTs) to High School Exit Exams

Educational reforms featuring student-level accountability. Shifts in rhetoric from the 1980s to the present.

Cohen, David K. & Haney, Walt. (1980). Minimum Competency Achievement Testing: Motives, Models, Measures, and Consequences. In R. M. Jaeger and C. K. .Tittle (Eds.), Minimum Competency Achievement Testing. Berkeley: McCutchan.

*Jacob, B. A. (2001). Getting Tough? The Impact of High School Graduation Exams. Educational Evaluation and Policy Analysis, 23(2), 99-121. (Available through E-Journals, following links from

M 2/25 State Testing in California: From CAP to NCLB

History of assessment in California, from the California Assessment Program (CAP) through CLAS, PTIP, STAR, the Public Schools Accountability Act of 1999, the No Child Left Behind Act of 2001.

*Kirst, M. W. , & Mazzeo, C. (1996). The Rise, Fall, and Rise of State Assessment in California: 1993-1996. Phi Delta Kappan, 78(4), 319-323. (Available through E-Journals, following links from

Some Principles and Beliefs about the Role of Assessment in California's School Reform Plan (n.d.)

W 2/27 National and International Assessments

Origins and evolution of the National Assessment of Educational Progress (NAEP); pressures toward uses of NAEP for accountability. Technical and logical challenges of cross-national comparisons. IEA assessments, TIMSS, OECD Indicators project, PISA, PIRLS. Comparison of testing in U.S. vs. other countries.

Wattenberg, R. (1995-1996). Helping students in the middle. American Educator, 19(4), 2-18. (This American Educator article includes samples of exam questions from other countries.)

*Linn, Robert L., Baker, Eva L. (1995). What Do International Assessments Imply for World-Class Standards? Educational Evaluation and Policy Analysis, 17(4), 405-418. (Available through E-Journals, following links from

The following reading is optional, but if you are interested in quantitative methods, you should skim it, at the very least.

*Rogosa, D. R. (2003). Four-peat: Data Analysis Results from Uncharacteristic Continuity in California Student Testing Programs. Available at

or by following link from

M 3/ 3 Classroom testing and the culture of academic competition

Culture of competition, "grades as wages," the rise of the meritocracy.

Goldman, S. V., & McDermott, R. (1987). The culture of competition in american schools. In G. D. Spindler (Ed.), Education and Cultural Process: Anthropological Approaches (pp. 282-299). Prospect Heights, IL: Waveland Press, 1987.

*Pope, D. C., & Simon, R. (2005). Help for stressed students. Educational Leadership, 62(7), 33-37. (Available through E-Journals, following links from

*Stephens, J. (2004, May). Justice or Just Us? What to do about cheating. Carnegie Perspectives. (Available at

*Murdock, T. B., & Anderman, E. M. (2006). Motivational perspectives on student cheating: Toward an integrated model of academic dishonesty. Educational Psychologist, 41(3), 129-145. (Available through E-Journals, following links from

*McCabe, D., & Treviño, L. K.(2002). Honesty and Honor Codes. Academe, 88(1), 37-41. (Available at

W 3/ 5 The SAT, the ACT, college admissions, and affirmative action

Meritocratic aims of the SAT, impact on admissions.

*Geiser, S., & Studley, R. (2001). UC and the SAT: Predictive Validity and Differential Impact of the SAT I and SAT II at the University of California. Report prepared for the Office of the President, University of California. (Available at

Jencks, C. (1989). If Not Tests, Then What? Conference remarks. In B. R. Gifford (Ed.),Test Policy and Test Performance: Education, Language, and Culture (pp. 115-122). Boston: Kluwer Academic Publishers.

M 3/10 Nature-Nurture

Longstanding controversies over the origins of pervasive "racial" group differences in test scores.

The Mental Age of Americans, et seq. [Lippman-Terman exchange from The New Republic], 1922-23.

*Cronbach, L. J. (1975). Five decades of public controversy over mental testing. American Psychologist, 30, 1-14. (Available through E-Journals, following links from

(Readings for this date continued on next page)

*Browne, M.W. (1994, October 16). What Is Intelligence and Who Has It?The New York Times Book Review. (From links to LexisNexis. At top of page, click tab for News. In "Specify Date" field near bottom of page, pick "Date Is" from menu, and enter 10/16/94. Enter the title "What is Intelligence and Who Has It" [with quotes]in "required terms" field.)

W 3/12 Stereotype vulnerability; course wrap-up

Claude Steele's influential paper documenting one potentially important factor contributing to group differences in test performance.

*Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual identity and performance. American Psychologist, 52(6), 613-629. (Available through E-Journals, following links from

*McClelland, David C. (1973). Testing for Competence Rather Than for "Intelligence." American Psychologist, 28, 1-14. (Available through E-Journals, following links from

______

*Not included in course reader. Available online as indicated.

**Not included in course reader and not available online. On reserve in Cubberley Library.

-1-