Chapter 5: Reliability Concepts

Definition of Reliability

– Test consistency

Classical Test Theory

– X =T+E

ObtainedTrueRandom
ScoreScoreError

Estimation of Reliability

– Correlation coefficients (r) are used to estimate reliability

– The proportion of variance attributable to individual differences

– Directly interpretable

– Reliability of .90 accounts for what % of the variance?

Conceptual True Score Variance

– Fine for true score variance

– T=X-E

– Convert formula to variances (2)

– Partial out the ratio of obtained score variance and error variance

– Substitute the ratio of obtained score variance for 1

– Reliability is the ratio of error variance to obtained score variance subtracted from 1

Types of Reliability

– Test-retest

– Alternate Forms

– Split-Half

– Inter-item consistency

– Interscorer

Test-Retest Reliability

– Coefficient of stability is the correlation of the two sets of test scores

Alternate (Equivalent)
Forms Reliability

– Coefficient of equivalence is the correlation of the two sets of test scores

Split-half Reliability

– Coefficient of internal consistency is the correlation of the two equal halves of the test.

– Reliability tends to decrease when test length decreases

Spearman-Brown Formula

– A correction estimate The S-B formula is computed with the following ratio:

# new items

# original items

Spearman-Brown Example

– Reducing the test length reduces reliability

– What is the new estimated reliability for a 100-item test with a reliability of .90 that is
reduced to 50 items?

Inter-item Consistency

– The degree to which test items correlate with each other

– Two special formulas to look at all possible splits of a test

– a) Kuder-Richardson 20

– b) Coefficient Alpha

Inter-scorer reliability

– Tests (or performance) are scored by two independent judges and the scores are correlated

- What are fluctuations attributed?

Possible Sources of Error Variance

– Error differences associated with differences in test scores

– Time Sampling

– Item Sampling

– Inter-Scorer Differences

– Time sampling

– Conditions associated with administering a test across two different occasions

– Item Sampling

– Conditions associated with item content

– Content heterogeneity v. homogeneity

Inter-scorer differences

– Error associated with differences among raters

Factors affecting the reliability coefficient

– Test length

– The greater the number of reliable test items, the higher the reliability coefficient

– Larger test length increases the probability of obtaining reliable items that accurately measure the behavior domain

– Heterogeneity of scores

– Item Difficulty

– Speeded Tests (Timed tests)

– Based on speed of work, not consistency of the test

– For example, consistency of speed, not performance

– Test situation

– Conditions associated with test administration

– Examinee-related

– Conditions associated with the test taker

– Examiner-related

– conditions associated with scoring and interpretation

– Stability of the construct

– dynamic v. stable variables

– stable variables more reliable

– Homogeneity of the items

– The more homogeneous the items, the higher the reliability

Interpreting Reliability

– A test is never perfectly reliable

– A method for interpreting individual test scores takes into account random error

– We may never obtain a test-taker’s true score

• Standard Error of Measurement (SEM)

• Provides an index of test measurement error

• SEM interpreted as standard deviations within a normally distributed curve.

• SEM is used to estimate true score by constructing a range (confidence interval) within which the examinee's true score is likely to fall given the obtained score

– SEM Formula

• St is the standard deviation

• rtt is the reliability

– SEM Example

– For example, X = 86, St = 10, rtt = .84

– What is the SEM?

– What is the Confidence Interval (CI)?

– Within 1 standard deviation, there is a 68% chance that the true score falls within the confidence interval

– 2 SDs = 95%

– 3 SDs = 99%

– Generalizability Theory

– Extension of Classical test theory

– Based on domain sampling theory (Tryson, 1957)

– Classical Theory emphasizes test error

– Generalizability Theory emphasizes test circumstances, conditions, and content

– Test score is considered relatively stable

– Estimates sources of error that contribute to test scores

– Variability is the result of variables or error in the test situation

– Importance of Reliability

– Estimates accuracy/consistency of a test

– Recognizes that error plays a role in testing

– Understanding reliability helps a test administrator decide which test to use

– Strong reliability contributes to the validity of a test