Lecture 2: Reliability and Validity

Reliability

I.Reliability and Test Design

A.Reliability: in general, it is the consistency or stability of the measurements obtained. It is also the extent to which systematic error is eliminated from the assessment process.

Reliability = Consistency of measures (or scores)

B.______: Error that is due to the test, the testing situation, or other factors, that could be eliminated or minimized, but are not. This error affects some students more than others, which makes it especially problematic.

  1. Clues in the test that give away correct answers
  1. Extremes of temperature in the testing room
  1. Loud, distracting noises
  1. Questions on the test that do not measure objectives from the instructional unit
  1. Items on the test that are in a format that is unfamiliar to the students

6.

7. etc. (There are a plethora of ways to go wrong!)

C.Random error: Error that is due to factors that are not in your control. But, since this error is random, it probably affects everyone equally in the long run and, as such, is not considered problematic.

  1. Memory (forgetting)
  1. Attention (distractibility)
  1. Fatigue (Rest/exhaustion from testing)
  1. Anxiety (Performance)
  1. etc. (Again, there are more sources than we can list.)

D.Important things to remember about reliability:

  1. Reliability refers to the RESULTS obtained with an instrument and not the instrument itself.
  1. Reliability is not an all or nothing concept; it’s never completely absent or absolutely perfect.
  1. A reliability estimate, called a reliability coefficient, refers to a specific type of reliability.
  1. Reliability is a necessary but not sufficient condition for validity.

E.Reliability in the classroom

1.______(Alternate Choice, Multiple Choice, Matching, & Keyed Response)

  1. *Guessing – The fewer alternatives, the easier it is to guess.
  1. Cheating – sharing answers, stealing from neighbors
  1. Clues in the item – grammar, number, plausibility
  1. Improperly keyed answer sheet
  1. ______(Product/Performance Assessments)
  2. Written responses
  1. Content VS. Form: content is the answer itself, form is the written presentation
  2. Elementary and middle school students – grade content and form separately
  1. Don’t lead students down the primrose path!
  1. Don’t be misled by entertaining writers (most teachers are!)
  1. Interrater reliability: The degree of agreement between two people scoring the same written response (or performance). It shouldn’t matter who scores the response, the score should be about the same regardless of who scored it.
  1. Intrarater reliability: The degree of consistency you exhibit as you score the responses to an item across a class of students. You should use the same criteria in the same way for each response.
  1. Use sufficiently specific objectives

For example:

Sample objectives for writing process:

Poor1. Students will write a well-organized paragraph about apples using descriptive adjectives.

Good2. Students will write a paragraph, beginning with a clearly defined topic sentence about apples, followed by several supportive statements using information learned in the unit, and concluding with a summary statement.

Sample objectives for content of response:

Poor3. Students will demonstrate their knowledge of apples in a well-crafted paragraph.

Good4. Students will demonstrate their knowledge of apples by listing apple-based products and discussing the economic impact of apples.

When our objectives are sufficiently specific, we are able to write questions more clearly and to score responses more reliably.

  1. Rubric: Detailed plan for scoring a written or performance response. A rubric shows how partial credit is earned. It breaks up the response into its component parts and then shows how points are awarded in each part.
  1. Oral reports
  2. Content VS. Delivery
  3. Young or inexperienced students may not be able to focus on the content and delivery at the same time.
  4. Score active performances more quickly by using checklists. Remember that active performances leave no physical trace to score later!

For example:

Students will give an oral presentation demonstrating their knowledge of apples by telling about three or more apple-based products and discussing at least two examples of the economic impact apples have on the economy.

Lists apple based productsEconomic impact

____ None (0)____ None (0)

____ 1-2 products (1 point)____ One impact (1)

____ 3 or more products (2)____ Two or more (2)

And for more advanced students:

1. Students will give an oral presentation demonstrating their knowledge of apples by telling about three or more apple-based products and discussing at least two examples of the economic impact apples have on the economy.

2. Students will maintain eye contact with the audience during the presentation.

3. Students will project their voices during the presentation.

4. Students will not read their presentations, however, note cards may be used sparingly.

AlwaysSometimesNever

Maintained eye contact 53 1

Voice was well projected 53 1

Presentation was not read 53 1

  1. Group work – In group work, we want to foster both content mastery and group skills
  1. Authorship: The person(s) responsible for the work. With group work, authorship is uncertain, unless we specifically assess it. When authorship is uncertain, we cannot tell which student(s) have achieved the objectives and which have not.
  1. Students who do not participate in group projects will miss opportunities to develop the interpersonal skills that are so important in the workplace and in life.
  1. By using a variety of assessments we get a more reliable indication of how much each student has achieved and how well each participated in the group’s activities.
  2. Teacher assessments of individual work
  1. Teacher assessment of the group product/performance
  1. Student assessments of individual participation (their own and one another)

F.Types of Reliability

1. ______– (A.K.A. Stability Reliability) the same test is administered twice, to the same person, with no learning (or forgetting) between administrations. If the measurement is reliable, scores should be very similar.

2. ______- two versions of the same test, measuring the same objectives at the same cognitive levels, are given to the same person. If the forms are reliable, scores from the two forms should be very similar. It should not matter which form of the test a student takes, one should be like the other, and yield scores that are consistent with the other form.

  1. Split-Half Reliability - Reliability estimate calculated by administering a test once and dividing it into two parallel halves and then calculating a form of internal consistency reliability.
  1. ______- a measure of the relationship between the test items. It tells us whether the assessment measures one thing or a variety of things?

KR-20 and KR-21 are two formulae used to estimate internal consistency of standardized tests. They must be at least .85 on a subtest, .90 on the full test. (They range between 0 and 1)

G.Test Design Considerations to improve reliability:

  1. Select a representative sample of objectives. If using multiple forms you want the instruments to be ______; sample the same objectives at the same level of difficulty.
  1. Select enough items to represent each objective. 1 item may not accurately measure a student’s ability on that item and more items could produce different results.
  1. Select item formats that are familiar to the students and reduce guessing.
  1. Give ample time to complete all items. Hurrying creates anxiety and carelessness; which decreases scores.
  1. Create positive student attitudes toward testing. Attitudes effect motivation; inform of date, time, and purpose. Never test as punishment.

Validity

I. Validity – in general, this refers to the appropriateness of score-based inferences; or decisions made based on students’ assessment results. It is the extent to which an assessment (such as a test) measures what it’s supposed to measure.

Problems arise when tests measure things other than what we intend

  1. Assessments that measure objectives not taken from the instructional unit
  1. Assessments that are too easy or too difficult for our students
  1. Assessments that measure extraneous content, e.g. when our students’ language achievement confounds our measurement of their math achievement on particular word problems or when their writing ability confounds our measurement of their achievement of science content.
  1. Assessments that are poorly constructed.

Important things to remember about validity

Validity refers to the______, and not to the assessment itself or to the measurement.

Like reliability, validity is not an all or nothing concept; it is never totally absent or absolutely perfect.

A validity estimate, called a validity coefficient, refers to a specific type of validity. (It ranges between 0 and +1)

Validity can never be finally determined; it is specific to each administration of the assessment.

Types of Validity

*1. ______– Refers to the relationship between a test and the instructional objectives. It is the extent to which the instructional objectives are measured by the assessment. Think of it as the match, or alignment, between instruction and assessment. Testing what you teach.

  1. The evidence of the content validity of your assessments is found in a Table of Specifications.
  1. This is the most important type of validity to you, as a classroom teacher.
  1. There is no coefficient for content validity. It is determined judgmentally, not empirically.

______: A chart that

  1. Specifies each instructional objective chosen for assessment
  1. Identifies the cognitive, affective, or psychomotor level of each objective
  1. Tells where the objective is measured (test item number or assessment type)
  1. Shows how many points each objective is worth, overall

For example:

A “Preliminary” Table of Specifications Based on Blooms Taxonomy:

Objectives
The ED 317 Student will be able to: / Level / # of items* / Total
Points
1. Define key assessment terms / Know / 10 / 20
2. Describe in their own words different kinds of reliability and validity / Comp / 2 / 6
3. Differentiate between norm- and criterion- referenced assessments / App / 2 / 6
4. Compute the mean, median, and mode for a set of scores / App / 1 / 4
5. Compute the range, variance, and standard deviation for a set of scores / App / 1 / 5
6. Discuss the relationship between reliability and validity / Anal / 1 / 5
*7. Develop an assessment for an original unit / Syn / 1 project / 75
*8. Critique a classmate’s assessment unit / Eval / 1 UFE / 10
Totals / 19 / 131

*Items 7 and 8 would not be given on the unit test; they would be separate assessments.

  1. ______- the extent to which scores from an assessment relate to theoretically similar measures. For example, Iowa Test of Basic Skills (ITBS) scores should indicate similar levels of performance as California Achievement Test (CAT) scores. Classroom reading grades should indicate similar levels of performance as Standardized Reading Test scores. There are two types of Criterion-related validity
  1. ______: The decisions we make are about the present, the scores are related to present performance and current status.

Follow this logical trail:

  1. An ITBS test tells how much students have achieved to this point in time.
  1. Partly based on those scores, we could make placement decisions about the students.
  1. We could then verify those placement decisions by having the same students take the CAT.
  1. To the extent that the results of these two tests give us similar indications of achievement, (to this point in time), we have evidence of concurrent criterion-related validity for the decisions we made based on the ITBS scores.
  1. ______: The decisions we make are about the future, the scores are related to possible future performance. Current performance is used to predict possible future status.
  1. College placement exams are routinely used to predict students’ likely success in college. (Scores are related to 1st semester G.P.A.)
  1. Intelligence (Aptitude) tests, such as Wechsler’s Intelligence Scale for Children, Revised (WISC-R), are routinely used in Education to predict children’s likely success in school during the upcoming year.
  1. Construct Validity– construct: a hypothetical and unobservable variable or quality, such as intelligence, math achievement, performance anxiety, etc. These are characteristics that we assume exist in order to explain some aspect of behavior. Most of what we measure in education falls under this heading. Because we cannot directly observe these characteristics, we must verify that we are measuring what we think we are measuring! This type of validity gets right to the heart of the matter, asking whether or not the test is measuring the construct that we think it is measuring.

There are three basic approaches to collecting empirical evidence for this type of validity.

  1. ______: One group of people are given two tests, the new test and a test that has been shown to be useful for making valid decisions about the construct, meaning that it measures what we are trying to measure with the new test. The scores from the new test are compared to scores from the useful test. If the scores are similar, then we have evidence for construct validity of inferences made using the new test.
  1. ______: Two groups of people that we know differ with regard to the construct, e.g. English proficiency, take the new test. If the new test is measuring what we intend it to measure, it should be able to differentiate between the two groups. Proficient speakers should score significantly differently than non-proficient speakers. If this happens, we have evidence for construct validity of inferences made using the new test.
  1. ______: A group of people who show a low level of performance on the construct are given the new test as a pretest. Then the group is randomly divided into control and experimental groups. The experimental group is then given an intervention, such as instruction related to the construct. Then both groups are re-tested using the new test. If the new test is measuring what we intend for it to measure, it should show an increase in performance for the experimental group, the group that was changed by intervention.

Test Design Considerations to improve validity:

  1. What is the purpose of the test?

2. How well do the instructional objectives selected for the test represent the instructional goals?

  1. Which test item format (type of item) will best measure achievement of each objective?

4. How many test items will be required to measure performance adequately on each objective?

  1. When and how will the test be administered?

The Relationship between Reliability and Validity

We have said that Reliability is necessary, but not sufficient, for validity. This means that . . .

Other Factors to Consider When Designing & Administering Assessments:

  1. Unclear directions
  1. Reading vocabulary too difficult
  1. Complicated Syntax
  1. Ambiguous items
  1. Inadequate time limits
  1. Inappropriate level of difficulty of the test items
  1. Poorly constructed test items
  1. Improper arrangement of items
  1. Unintended Clues

"My Summary of Desirable Features for Enhancing Validity and Reliability*

Desired featuresProcedures to follow

1. Clearly specified set of learning1. State intended learning outcomes

outcomes. in performance terms.

2. Representative sample of a2. Prepare a description of the

clearly defined domain of achievement domain to be assessed

learning tasks. and the sample of tasks to be used.

3. Tasks that are relevant to the3. Match assessment tasks to the

learning outcomes to be measured. specified performance stated in the

learning outcomes.

4. Tasks that are at the proper4. Match assessment task difficulty to

level of difficulty. the learning task, the students'

abilities, and the use to be made of

results.

5. Tasks that function effectively5. Follow general guidelines and

in distinguishing between specific rules for preparing

achievers and nonachievers. assessment procedures and be alert

for factors that distort the results.

6. Sufficient number of tasks to6. Where the students' age or available

measure an adequate sample assessment time limit the number of

of achievement, provide tasks, make tentative interpretations,

dependable results, and allow assess more frequently, and verify

for a meaningful interpretation the results with other evidence.

of the results.

7. Procedures that contribute to7. Write clear directions and arrange

efficient preparation and use. procedures for ease of administration scoring or judging, and interpretation"

*Gronlund, Norman E. (2006) Assessment of Student Achievement. p. 26

1