Dr. Olson’s Comments on Students’ Posts to Forum 4

Forum 4: Reliability and Validity
Rudner (2001), Brualdi (1999), and Popham (in his text) discuss notions of reliability and validity. In your own words, differentiate between the two concepts. In differentiating the two, be sure to discuss what it is that each concept concerns. Then, discuss how reliability and validity are important (or not important) to classroom assessment. Further, Popham (in his text) discusses the problem of bias. In your discussion of reliability and validity, show which is influenced by bias.
Finally, as a school leader, what would you tell your teachers about reliability and validity? Popham and Brualdi suggest that teachers do not need to worry about low reliability (.4 to .6, say). Why? What about classroom tests with reliabilities lower than .4? What should a teacher do if she or he determines that her or his classroom tests have very low reliabilities?

Although, I think, most of you did a pretty good job at drawing meaning from the readings, there appears, to me, that there is still a fair amount of confusion about reliability and validity.

Reliability

Reliability = Assessment Consistency.

Consistency within teachers across students.

Consistency within teachers over multiple occasions for students.

Consistency across teachers for the same students.

Consistency across teachers across students.

Consistency within teachers across students. This means, for example, that teachers grade and assess ALL students using the same standards and criteria. If effort is used when grading some students then the same standard involving effort is used for ALL students. This doesn’t usually happen in the lower grades where poorer performing students are often given higher grades when they seem to be working hard.

Consistency within teachers over multiple occasions for students. An example would involve scoring papers. Reliability, here, would require that students’ papers, when scored a second time, would be scored the same way as the first time. Actually, all that is required is that the rank ordering of students on the second scoring is essentially the same as the first time.

Consistency across teachers for the same students. This requires that different teachers scoring or assessing the same students all rank, or essentially rank, those students the same way. This most likely does not occur very often—especially when teachers often do not even score their own students the same way on different occasions.

Consistency across teachers across students. Consistency, here, includes a composite of the previous three.

Another form of reliability, that is not well understood, is internal consistency reliability. Reliability, here, means that the assessment tends to be uni-dimensional, i.e., all the components (or items) of the assessment are measuring the same construct. An instance where this often does not occur is a math problem solving test. When the vocabulary used is not accessible to some students or, worse, when the problems are confusing or not well-written then the assessment, at least in part, is measuring students’ ability to comprehend the problem. When this happens, internal consistency reliability suffers.

Now, the question: How concerned do teachers need to be with respect to reliability? The authors of the assigned readings (and many others) don’t think reliability should be a big issue with teachers. That’s a good thing. You may have missed it when reading the articles by Stiggens and his colleagues. Stiggens investigated the reliability of teacher-constructed tests and found that the tests rarely about .25 (and that was for longer tests). When I was in Dallas I did a large study where we asked some 7,000 teachers to submit one of their multiple-choice tests along with their students’ results. I computed internal consistency reliabilities on all the tests. The average reliability was .17! To be sure there were some tests with much higher reliabilities (as high as .7), but not many. There were also a few tests with negative reliabilities (-.15). Theoretically, a negative reliability is impossible. So, how did this happen. Further investigation showed that when this happened, the answer key was wrong (or students were given incorrect or mis=information.

Anyway, getting back to why reliability (of individual tests) is not a big concern to classroom teachers, the reason is that teachers collect a large amount of information (data points) about their students. One of the major ways to improve reliability is to collect more data points (usually in the form of longer tests.) Now if the “score” we are interested in is the grade students get on their report cards, then all this information (data points) generally reflects the consistency with witch teachers grade students. In the final analysis, teachers, within themselves, tend to be quite consistent (reliable) in terms of what they include in the grades they give students. On the other hand, what can be said about the inferences that can be draw from those grades? This is where validity comes into play.

Validity

Even though, in more casual language, we often talk about the validity of a test or assessment, we need to remain clear that we are talking about the validity of the inferences we draw based on the scores individuals attain on the assessment.

Most of you seem to have gotten this right—or at least repeated what was given in the readings— correctly. Ever since Messick’s 1989 seminal article Educational Measurement, measurement specialists have viewed validity as addressing the problem of inferences based on scores obtained from tests and assessments. While the concept of validity is complex. As Brauldi points out, validity requires an argument. Can you argue, for instance, that scores on an assessment correlate (statistically speaking) with scores obtained some other assessment or with your own knowledge of the students being assessed (criterion-related)? Can you show, on a career aptitude test, foe example, that scores obtained at one time are similar to scores obtained by the same individual, on the same test, a month or two later (stability reliability)? Or, do you have evidence that, on an achievement test in History, say, students who have had a History course perform better than students who have not had a History course (construct reliability)?

In other words, validity involves an argument and the weight of the evidence in supporting the inferences drawn from assessment scores (note, I generally include as a score, a report card grade. When we draw an inference about a student’s level of achievement (skill, competence) based on some assessment (a test, an observation, a performance) then it behooves us to, provide, when asked, the evidence to support that inference. Just as a defense attorney presents evidence in support of his or her theory of events, and a prosecutor presents evidence in support of his or her theory, hoping that the jury will be won over, so does the user of an assessment present evidence in support of his or her position that the inference he or she proffers is the correct inference. Since report card grades carry high stakes consequences it is important that the evidence be strong. When a teacher gives a student a “B” what is the inference that a parent, another teacher, or the student draws from this grade? Shouldn’t the inference concern level of achievement or accomplishment? But, what if that grade includes a large effort component? Or, a behavior component? Then how valid will the inference be? You will read more about this in the readings for next week.

The readings this week talked about various types of evidence that can be used to support the validity of inferences: content, criterion, and construct. For classroom teachers (for classroom assessments), content evidence of validity is probably the most important. (I’m surprised no one picked up on this.) Content evidence of validity involves demonstrating that tests or assessments are well articulated with learning targets. For example, if you want to assess a student’s skill at writing a persuasive essay you have the student write a persuasive essay and then score the essay for its persuasiveness—not for its spelling, or grammar. Similarly, if you want to assess a student’s skill at speaking a foreign language, you have the student speak in that language and assess his or her oral performance.

As most of you pointed out, validity is something teachers should be concerned with. When teachers give a student as score on a test (or on a report card) some inference about the student’s level of competence inevitability follows. The question every teacher should ask is, “What is the inference?” and “Is the inference supported by evidence?”

Most of you realized that bias affects validity. Whenever a test score is, at least in part, the result of some form of bias, the inferences drawn from the score will lack validity.

Now, I would like to address some of your more specific comments.

Keeley was concerned about the alignment of the alternate forms of the ABC tests (I use ABC tests to include both EOG and EOC tests). We will see later in the course that scores on all the forms of a test over a specific area are equated to a common scale. While it is true that one form may be easier or harder than another form in terms of number of items correct, the fact that the forms are equated and placed on a common scale mitigates any problems this might present. All that is required, across the forms, is that they preserve the rank-ordering of students taking the different forms.

Stephanie and Elizabeth raised a question about the embedding of tryout items into operational forms of the ABC tests. The only way, really, to try out new items is to embed the items in operational tests. As Olivia pointed out, students are unlikely to take tests (or items) seriously if they know the tests or items will not count. (For an interesting article related to this, see Brown and Walberg (1993) in the Directory of Articles.

And Jessica, when you stated, “ I was intrigued by his suggestion that classroom teachers should be calculating K-R21 coefficients for classroom exams,” I think you need to re-read that section in Popham. Basically, what he is saying is that it would be absurd to expect teachers to be calculating KE-21s for classroom assessments. Also, thanks for the chart you provided for what to look for when selecting an instrument. This leads me to my final comment.

One thing you should all keep in mind: You may be called upon, having had this course, to help your school or district evaluate various assessment instruments (say, for example, a new or different instrument for screening children for special programs.) In this situation, how would you go about the task? I would suggest that you first examine the reliability of the instrument. Of the various ways to look at reliability you would probably want to examine the internal consistency (Coefficient alpha, or KR20.) Then , if the assessment is an achievement test, you would want it to have a reliability coefficient of around .90 or higher. If the instrument is an aptitude test, then the reliability should be around .80 or higher.