Classical Test Theory

Classical Test Theory & Reliability Notes

Psychometrics

Psychometrics concerned with measurement of psychological attributes (personality, abilities, attitudes etc)

Aim is often to study differences in individuals/groups

Crucial is that measurement is accurate and reliable

Understanding of statistical theory can help us:
construct reliable test measures
assess reliability (and other properties) using formal tests based on statistical theory (e.g. Cronbach’s alpha)

Statistical theory in psychometrics

Classical Test Theory (CTT)

Evolved in early 1900’s from work of Spearman on measurement of individual differences in mental abilities

Used in psychometric research for over a century

Basic principle: X=T+E

observed score= true + error

Item Response Theory (IRT) – not covered in this lecture

Modern approach to reliability assessment (also Rasch model)
Certain advantages of IRT approach over CTT

Also relative advantages of CTT inc. valid with smaller representative samples, relatively straightforward maths & available in SPSS!

Fan (1998) even suggests that little empirical difference between CTT & IRT

Basic Facts and Notation

N =sample of N observations

xobs= observed value, and xi = observed value for ith subject

 = ‘the sum of’

Population statistics:

 = population mean,  = population standard deviation , 2 = population variance

Sample statistics:

 = sample mean, sx= sample standard deviation, sx2= sample variance

Note that:

Var(xobs) = s2= / i (xi –)2
N – 1

Distributions

D(Q) denotes the distribution of values of a quantity Q

G (a,b) denotes a normal (Gaussian) distribution with mean of a and a standard deviation of b

For example, D(IQ)=G(100,15)

Gn (0,1) is the nthvalue drawn from a standard normal distribution

ε = a random value from a normally-distributed error variable, with mean of zero and a variance of err2

Expected values

Exp{Q} = expected value of a quantity Q.

In a large sample of measures of Q, this value will be approximated by the average value of Q

Assumptions of Classical Test Theory

CTT starts with the notion of the true value of a variable, e.g. xtrue. CTT assumes that the true values of variable x in a population of interest follow a normal (or ‘Gaussian’) distribution. Let us denote the population mean by  and the population standard deviation true. Using the notation introduced above, we can therefore write that the distribution of the true value for a population of participants will be:

D(xtrue) = G1( ,true)… (1)

The population values ( and true, also called parameters) are different from those which we measure in a sample (ands) due to sampling error. Classical test theory is concerned with how the measured (i.e. observed) values of x will be related to the true values xtrue. CTT proposes that the observed values are a combination of the true values plus a random measurement error component. By stressing random error, CTT is making 3 assumptions about the error component:

(i)The error component will have zero mean and so the observed mean will not be systematically distorted away from the true value by the error (this contrasts with a systematic bias effect which would distort the observed mean away from its true value).

(ii)The measurement errors are assumed to follow a normal distribution.

(iii)The measurement errors are uncorrelated with the true values.

Therefore, according to CTT, we can write the following expression for the distribution of xobs:

D(xobs) = D(xtrue) + G2(0, err)…(2)

where err is the standard deviation of the normal random error term. For an individual (ith)participant we could also write the following expression for their observed score on variable x:-

xi = xi,true + 2i… (3)

where xi,truedenotes the value of xtrue for participant i, drawn from the true value distribution G1( ,true), and 2i denotes the error term for the ith participant, drawn from the error distribution G2(0, err).

It follows from these assumptions that the expected value of the sample mean, Exp{} is .Also, the sample standard deviation of observed x is going to be larger than trueas the random error component (with a standard deviation of err) increases the variation in xobs.

In fact, it is easy to work out the expected sample variance exactly. Imagine one has two variables a and b, and a variable c which is the sum of a and b (i.e., a + b). The variance of the new variable c is given by:

Var(c) = Var(a) + Var(b) + 2rab*√[Var(a)*Var(b)]… (4)

or 2c=a2 +b2 + 2rab*√ [a2*b2]

where rab is the correlation between a and b

We can use the general formula given in (4) to evaluate the expected sample variance according to CTT, as the observed values are the sum of the true values and random measurement error. Using ‘true’ and ‘error’ instead of ‘a’and ‘b’, the expected value of the sample variance is just the sum of the variances of the true score and error terms:

Exp{s2} = true2 + err2… (5)

The last term in formula 4 disappears because we are assuming that there is zero correlation between true & error values (assumption 3).

Correlations Between Measures

Assume two measures (x and y) index the same psychological process, p. Following CTT, assume further that pis normally distributed in the population of interest with a mean p and standard deviation p. We can write the following expressions for the distributions of xobs and yobs:

D(x) = G1(p, p) + G2(0, errx)… (6)

D(y) = G1(p, p) + G3(0, erry)

The subscripts on the normal distribution terms (i.e. Gn) are used to indicate that the error distributions affecting x and y are independent of one another (different subscript), whereas the measures contain an identical true value for the same subject (drawn from the same distribution, indicated by a common subscript).

The correlation between x and y is denoted by rxy and is defined thus:

rxy = / Covar(x, y) / …(7)
√[x2*y2]

… (7)

where Covar(x.y) is the covariance between xandy. rxyindicates the proportion of covariance relative to the total variance (and thus r cannot exceed 1). The more x and y measure p (i.e. the smaller the error), the greater the value of rxy.

From equation (5) above, the expected values of sample variance of x and y can be expressed as:

Exp{sx2} = p2 + errx2… (8)

Exp{sy2} = p2 + erry2

The covariance (shared variance) between x and y is p2. If we substitute the expected variances of x and y given above (8) into equation 7, we can obtain the following result for the expected value of the correlation:

Exp{rxy} = / p2 / …(9)
√[(p2 + errx2)*(p2 + erry2)]

Reliability of a Measure

The reliability of a measure is defined as the proportion of variance in the measure that is due to the construct being measured (rather than measurement error). Thus, according to CTT reliability is defined as:

Reliability = true2 / (true2 + err2)… (10)

We can show, using CTT, that a test-retest correlation for a particular measure, will give us an estimate of reliability. A test-retest correlation uses a measure taken at two separate times (denoted 1 and 2). The test-retest correlation for x can be written as r12. We can write the following expressions(from formula 6):

D(xobs1) = G1(, true) + G2(0, err1)… (11)

D(xobs2) = G1(, true) + G3(0, err2)

where we have assumed that the error variance at time 2 is independent of the error variance at time 1. However, we have no reason to suppose that the amount of measurement error variance will differ at each timepoint. Therefore, we assume that err1= err2 = err. Using this assumption, plus the expressions given in (9)and (11) the expected value of the test-retest correlation has the following value:

Exp{r12} = true2 / (true2 + err2)… (12)

which is the same as the CTT definition of reliability (10, above).

Summing Two Measures to Increase Reliability

We can use CTT to show that summing measures of the same construct increases reliability. Let’s return to our two measures x and y of process p that we discussed previously. Let’s define our combined measure, Cobs = 0.5*(xobs + yobs). Using the expressions in (6), defined above, we can write:

D(Cobs) = G1(p, p) + 0.5*G2(0, errx) + 0.5*G3(0, erry)… (13)

For simplicity let us assume that both xobs and yobs are equally reliable measures; (i.e., same measurement error). Therefore, we can write the following errx= erry = err. Also, it follows from the definition of standard deviation that n*G(m, s)= G(n*m, n*s)- that is, if you multiply the entire set of scores by a number n, then the standard deviation (and the mean) become n times bigger (for example, ‘-101’ x 0.5 = ‘-0.5 0 0.5’).We can therefore rewrite expression (13) thus:

D(Cobs) = G1(p, p) + G2(0, 0.5*err) + G3(0, 0.5*err)… (14)

Using our result for the sum of two variables (expression 4 above), we can combine the two uncorrelated error terms of expression (14) into a single random error term, thus:

Cobs = G1(p, p) + G4(0, √0.5*err) … (15)

From the definition of reliability given earlier (10), we note that the reliability of the combined measure is p2 / (p2 + 0.5*err2). However, from previous expressions we can see that either measure alone has a reliability of p2 / (p2 + err2).It is thus apparent that the combined measure is more reliable. In fact, the absolute amount of error variance in the combinedmeasure is halved.

Cronbach’s Alpha () as a measure of internal consistency

In the previous section we combined two variables by averaging them. We can extend this by combining more than 2 variables (either by summing or averaging, it doesn’t matter which). Let us assume we have a scale that is made up of n items and the observed scores on each item are given by xj. Let the scale total score (formed by adding all the items) be described as xsum (i.e., xtotal = jxj). Cronbach’s  is a measure of reliability for such a scale. It is intended to reflect the internal consistency of the scale, i.e. the extent to which all the items tend to measure the same construct (process p), and it is defined thus:

2xj / ] …(16)
2xtotal

 = [n/(n - 1)] * [1 -

Using CTT we can assume that distribution of scores on each item can be represented by the following kind of expression:

D(xj) = G(p, p) + Gj(0, errj)…(17)

where Gj relates to the unique error associated with each item. It is easy to show, using CTT and expressions like (17), that the value of  reduces to the following expression:

 = (n2 * p2) / {(n2 * p2) + jerrj2 }… (18)

Expression (18) reveals why  measures internal consistency. For example, if the scale has only two items (n = 2) and the error variance for each item is the same (i.e.,err1= err2 = err) then expression (18) reduces to p2 / (p2 + 0.5*err2). This was shown earlier to be the reliability of a combined (average) score derived from two variables (see equation 15).

Note also that  is directly related to the square of the number of items in the scale. Imagine that, for every item on the scale, the true and error variance for item on the scale are the same (i.e., p2 = err2). The reliability of each item is thus 0.5. If you have two items (n=2) then the internal consistency, ,is 0.667. If you have 10 items (n=10) then is 0.909. So the 2-item scale and the 10-item scale differ numerically in the value of  even though both scales are made up of individual items of identical reliability. Thus, an impressive-seeming value of  (e.g., 0.95) mainly reflects the fact that you have a scale with many items.

An alternative, more intuitive expression of formula 18 --algebraically reworked-- is:

This expression further underlines that there are two important factors that increase reliability (alpha). The number of items in the test (or number of tests) and the correlation between the items. The more items and the higher their correlation, the greater the reliability.

Note: when calculating Cronbach’s alpha, make sure the reverse-coded items are reflected (i.e. scored in the same direction as the other items).