Scale validation:

Analysis of the quality of

summated index scales

Svend Kreiner

Dept. of biostatistics, Univ. of Copenhagen

Index scales

Provides indirect measurement of unobservable phenomena

Defined by functions summarizing responses to a number of items

S = f(Y1,…,Yk)

Examples:

Educational tests

Psychological or psychiatric tests

Measurement of socioeconomic status

Attitude measurements

BMI

Health related quality of measurement scales

Most measurement instruments are summated scales, S = iYi.

BMI and Social class are two examples of scales with other types of scale functions.

Example: CHIPS items – a cognitive test

Figur 1. Fire CHIPS items.

The physical functioning (PF)

subscale of SF-36

Does your health now limit you in

these activities? If so, how much?

PF1) Vigorous activities

PF2) Moderate activities

PF3) Lifting or carrying groceries

PF4) Climbing several flights of stairs

PF5) Climbing one flight of stairs

PF6) Bending, kneeling, or stooping

PF7) Walking more than a mile

PF8) Walking several blocks

PF9) Walking one block

PF10) Bathing or dressing yourself

Three response categories

0: Not limited

1: Limited a little

2 : Limited a lot

1

The PADL (Physical Activities of Daily Living) measure of functional ability of healthy elderly.

Mobility function / Lower limb function / Upper limb function
A: Are you able to walk
indoors? / G: Are you able to wash the
lower part of the body? / G: Are you able to wash the
upper part of the body?
B: Are you able to walk out
of doors in nice weather? / H: Are you able to cut your
toenails? / M: Are you able to cut your
fingernails?
C: Are you able to walk out
of doors in bad weather? / I: Are you able to go to the
toilet yourself? / N: Are you able to comb
your hair?
D: Are you able to manage
stairs? / J: Are You able to dress the
lower part of the body? / O: Are you able to wash
your hair?
E: Are you able to get
outdoors? / K:Are you able to take
shoes/stockings on/off? / P: Are You able to dress the
upper part of the body?
F: Are you able to get up
from a chair or bed?

0 = “Cannot do it at all, or cannot do it without getting tired”

1 = “can do it without getting tired”

1

Do CHIPS, SF-36 and PADLprovide good measurements?

What can we require of high-quality scales?

Type of requirement / Type of considerations / Requirement
Validity / Substantive / Face validity
Content validity
Statistical / Criterion validity
Construct validity
Technical / Statistical / Sufficiency
Objectivity
No DIF
Reliability
Sensitivity/specificity
Ability to discriminate
Practical / Simplicity
Feasibility

How many of these requirements doCHIPS, SF-36 and PADL meet?

Face validity

Requires substance matter arguments

Some general requirements related to face validity:

Existence of a latent quantitative or ordinal variable (the construct)

A causal relation with the latent variable as the cause and item responses as effects

Monotonous relationships: E(Yi |  = ) is an increasing function of 

Some thinking about response behaviour

What construct does the PF subscale measure?

Is it different from the construct measured by the PADL items?

Content validity

A question of item coverage:

A summated scale should include items relating to all relevant aspects of the construct.

The PADL “item bank”

How should items for a short version of the PADL scale be selected?

No content validity

Content validity

Psychometrics

Three different (but related) traditions:

Classical psychometrics (not to be discussed here)

Item response theory for categorical items

Factor analysis (not to be discussed here)

Criterion validity

The score must correlate with all variables known in advance to be correlated to the latent variable

The SF-36 scale must be correlated to self reported health:

PF
Score / Self reported health
very good / good / fair / bad / Very bad
0 / 44,8% / 48,4% / 6,6% / ,2%
1 / 21,4% / 61,6% / 16,3% / ,7%
2 / 19,8% / 57,1% / 20,7% / 1,8% / ,5%
3 / 10,5% / 55,3% / 31,6% / 2,6%
4 / 12,5% / 46,2% / 35,6% / 5,8%
5 / 9,5% / 40,5% / 44,6% / 4,1% / 1,4%
6 / 10,5% / 50,9% / 35,1% / 1,8% / 1,8%
7 / 25,7% / 20,0% / 40,0% / 14,3%
8 / 7,1% / 21,4% / 57,1% / 14,3%
9 / 9,4% / 18,8% / 62,5% / 9,4%
10 / 8,6% / 14,3% / 68,6% / 5,7% / 2,9%
11 / 29,6% / 48,1% / 18,5% / 3,7%
12 / 5,0% / 15,0% / 65,0% / 15,0%
13 / 3,7% / 3,7% / 70,4% / 18,5% / 3,7%
14 / 21,4% / 50,0% / 28,6%
15 / 18,2% / 54,5% / 9,1% / 18,2%
16 / 9,1% / 45,5% / 27,3% / 18,2%
17 / 9,1% / 36,4% / 45,5% / 9,1%
18 / 14,3% / 71,4% / 14,3%
19 / 9,1% / 27,3% / 36,4% / 27,3%
20 / 12,5% / 50,0% / 37,5%
Total / 29,4% / 47,8% / 19,0% / 3,0% / ,8%

Goodman & Kruskall’s  = 0.607 (p = 0.000)

Criterion validity

The PF score must be correlated with SRH

Goodman & Kruskall’s  = 0.607 (p = 0.000)

PADL

Goodman & Kruskall’s  = -0.544 (p = 0.000)

Construct validity

Items have to be face valid and item responses must not depend directlyon anything but the latent variable

Illustrated in the IRT graph

Psychometric models are graphical models defined by assumptions concerning conditional independence

Notation

Four types of variables:

Items: Y = (Y1,…,Yk)

The total score:S = iYi

The latent trait variable: 

Exogenous variables:X = (X1,…Xm)

Conditional independence:

XY|Z P(X|Y,Z) = P(X|Z)

Construct validity

Construct validity requires

1)Unidimensionality

2)Monotonicity and causality

3)Local independence (YiYj | )

4)No differential item functioning (DIF): (YiXj | )

Consequences of construct validity

1)Items must be positively correlated

2)Items must be positively correlated with rest scores

3)If the score correlates with an exogenous variable, X, then all items must correlate with X in the same way.

4)If the latent variable is monotonously correlated with an exogenous variable, X, then the same is required for the score (criterion validity)

1) – 3) are requirements of consistency

4) is a requirement of criterion validity

1

Association among PF items (gamma coefficients)

rest

A B C D E F G H I J score

------

A Vig.act. Gamma 0.915 0.889 0.861 0.873 0.835 0.858 0.839 0.821 0.845 0.860

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

B Mod.act Gamma 0.915 0.960 0.886 0.935 0.875 0.910 0.929 0.925 0.918 0.919

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

C Liftgroc Gamma 0.889 0.960 0.886 0.943 0.868 0.900 0.928 0.934 0.922 0.909

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

D Stair2+ Gamma 0.861 0.886 0.886 0.982 0.896 0.937 0.948 0.949 0.902 0.892

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

E Stair1 Gamma 0.873 0.935 0.943 0.982 0.945 0.948 0.962 0.966 0.942 0.958

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

F Bending Gamma 0.835 0.875 0.868 0.896 0.945 0.898 0.925 0.930 0.922 0.870

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

G Walk 1m Gamma 0.858 0.910 0.900 0.937 0.948 0.898 0.980 0.968 0.914 0.918

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

H Walk 2+b Gamma 0.839 0.929 0.928 0.948 0.962 0.925 0.980 0.997 0.941 0.959

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

I Walk 1bl Gamma 0.821 0.925 0.934 0.949 0.966 0.930 0.968 0.997 0.955 0.959

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

J Bathing Gamma 0.845 0.918 0.922 0.902 0.942 0.922 0.914 0.941 0.955 0.914

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Items and Exogeneous variables

A B C D E F G H I J score

------

K srh Gamma -0.627 -0.729 -0.672 -0.711 -0.748 -0.683 -0.752 -0.831 -0.852 -0.814 -0.607

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

L BMI Gamma -0.221 -0.164 -0.186 -0.299 -0.294 -0.328 -0.250 -0.294 -0.241 -0.282 -0.224

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

M Smoking Gamma 0.095 0.013 -0.011 0.051 -0.018 -0.046 0.004 -0.055 -0.073 -0.020 0.071

p 0.001 0.388 0.403 0.114 0.385 0.145 0.464 0.198 0.162 0.395 0.006

N Sex Gamma -0.201 -0.365 -0.379 -0.264 -0.212 -0.173 -0.197 -0.155 -0.183 -0.008 -0.203

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 0.009 0.457 0.000

O Age Gamma -0.542 -0.628 -0.690 -0.617 -0.765 -0.590 -0.676 -0.760 -0.741 -0.679 -0.516

p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Item responses are consistent:

No evidence against construct validity

1

Differential item functioning (DIF)

The unobserved latent variable is the only variable that influences item responses under the assumption of construct validity

The observed item responses can therefore provide indirect information on (measurement of) the value of the latent variable.

DIF means that item responses are influenced by other variables. Measurement will therefore be confounded

Confounding of measurement Due to DIF

To avoid confounding of measurement we have to identify and remove differentially functioning items

1

Analysis of differential item function

This is almost, but not quite, the same requirement of no DIF found in connection with the definition of construct validity.

To distinguish between the two types of DIF we refer to structure related and score related DIF.

Mantel-Haenszel procedures and partial gamma coefficients test whether there is no score related DIF.

Are there any psychometric models satisfying

both requirements of no DIF?

Answer: The so-called Rasch models are the

only known models with this property.

Analysis of DIF

Three different types of analysis:

Mantel-Haenszel analysis in three-way tables

Analysis by partial Gamma coefficients

Logistic regression analysis

Al three types of analysis tests the same hypothesis of conditional independence:

Item  Covariate | Score

DIF analyses of PADL

Is “Wash hair” biased relative to sex?

padl / washhair / Total
tired or worse / not tired
,00 / Sex / Male / 5 / 5
Female / 7 / 7
Total / 12 / 12
1,00 / Sex / Male / 1 / 1
Female / 1 / 1
Total / 2 / 2
2,00 / Sex / Male / 1 / 1
Female / 3 / 3
Total / 4 / 4
3,00 / Sex / Male / 2 / 1 / 3
Female / 2 / 0 / 2
Total / 4 / 1 / 5
4,00 / Sex / Male / 1 / 1 / 2
Female / 3 / 0 / 3
Total / 4 / 1 / 5
. / . / . / . / . / .
12,00 / Sex / Male / 0 / 21 / 21
Female / 3 / 25 / 28
Total / 3 / 46 / 49
13,00 / Sex / Male / 0 / 45 / 45
Female / 4 / 46 / 50
Total / 4 / 91 / 95
14,00 / Sex / Male / 35 / 35
Female / 26 / 26
Total / 61 / 61
15,00 / Sex / Male / 0 / 24 / 24
Female / 1 / 43 / 44
Total / 1 / 67 / 68
16,00 / Sex / Male / 162 / 162
Female / 133 / 133
Total / 295 / 295

The Mantel-Haenszel analysis:

1)calculates odd-ratios in each stratum

2)tests that all odds-ratios are the same

3)test that odds-ratios are equal to one (conditional independence)

Strata / Odds-ratio / test
3 and 4 / 0.0 / Breslow-Day test of
equal odd ratios
2= 6.0 df = 9 p = 0.74
5 / 0.05
7 / 2.0 / Mantel-Haenszel test
of conditional ind.
2= 5.1 df = 1 p = 0.025
8 / 0.0
9 / 0.5 / MH- estimate = 0.233
10 / 0.583
12, 13 and 15 / 0.0

Evidence of differential item functioning

Logistic regression will do the same trick

Dependent variable : WASHHAIR

Independent : Sex + PADL score

The PADL score has to be defined as a categorical variable

B / S.E. / Wald / df / Sig. / Exp(B)
padl / 46,121 / 16 / ,000
padl(1) / -42,609 / 11533,591 / ,000 / 1 / ,997 / ,000
padl(2) / -42,721 / 27839,275 / ,000 / 1 / ,999 / ,000
padl(3) / -42,350 / 19723,588 / ,000 / 1 / ,998 / ,000
padl(4) / -23,003 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(5) / -22,719 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(6) / -21,630 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(7) / -21,352 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(8) / -21,336 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(9) / -22,347 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(10) / -20,464 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(11) / -19,617 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(12) / -,101 / 7618,819 / ,000 / 1 / 1,000 / ,904
padl(13) / -18,296 / 2281,692 / ,000 / 1 / ,994 / ,000
padl(14) / -17,949 / 2281,692 / ,000 / 1 / ,994 / ,000
padl(15) / -,034 / 5509,874 / ,000 / 1 / 1,000 / ,967
padl(16) / -16,723 / 2281,692 / ,000 / 1 / ,994 / ,000
Sex(1) / -1,435 / ,580 / 6,124 / 1 / ,013 / ,238

Logistic regression permits

1)Analysis of several different potential sources of DIF at the same time

2)Covariates with more than two categories

Variables in the Equation

B / S.E. / Wald / df / Sig. / Exp(B)
padl(1) / -43,018 / 10945,971 / ,000 / 1 / ,997 / ,000
padl(2) / -43,145 / 28314,566 / ,000 / 1 / ,999 / ,000
. / . / . / . / . / . / .
padl(15) / -,083 / 5537,202 / ,000 / 1 / 1,000 / ,920
padl(16) / -16,708 / 2373,083 / ,000 / 1 / ,994 / ,000
Sex(1) / -,773 / ,660 / 1,369 / 1 / ,242 / ,462
socgrp / 1,908 / 6 / ,928
socgrp(1) / -1,378 / 1,626 / ,718 / 1 / ,397 / ,252
socgrp(2) / -1,472 / 1,441 / 1,043 / 1 / ,307 / ,229
socgrp(3) / -1,337 / 1,443 / ,858 / 1 / ,354 / ,263
socgrp(4) / -1,728 / 1,352 / 1,635 / 1 / ,201 / ,178
socgrp(5) / -1,608 / 1,946 / ,683 / 1 / ,409 / ,200
socgrp(6) / -1,027 / 1,928 / ,283 / 1 / ,594 / ,358
pensage / ,130 / ,043 / 8,936 / 1 / ,003 / 1,139

Pension age appears to be the real DIF source:

The later that you have been pensioned, the better your ability to wash your hair after the total PADL score have been taken into account.

Pension Age confounds the measurement of physical functioning.

Social class is not a source of DIF

Sex does not appear to be a source of DIF when Pension age and Social Class have been taken into account.

To check the result we remove Social Class from the analysis.

B / S.E. / Wald / df / Sig. / Exp(B)
padl(1) / -42,479 / 11693,480 / ,000 / 1 / ,997 / ,000
padl(16) / -16,778 / 2396,101 / ,000 / 1 / ,994 / ,000
Sex(1) / -,830 / ,642 / 1,671 / 1 / ,196 / ,436
pensage / ,122 / ,040 / 9,089 / 1 / ,003 / 1,130

Sex is still insignificant

Ordinal variables

The problems:

1)Ordinal variables are treated as nominal by logistic regression

2)Items with more than two categories cannot be handled by conventional logistic regression models.

The solutions:

1)Dichotomize items before the analysis by DIF, but after calculation of the score.

2)Use partial gamma coefficients instead of Mantel-Haenszel statistics and logistic regression

Why does dichomization work?

Assume that

1)X and Y are independent

2)X has three categories

Dichotomize X:Z = 1 if X= 1

Z = 2 if X = 2 or 3

Independence of X and Y imples that Z and Y also are independent:

P(Z=1, Y = y) = P(X=1, Y = y) = P(X=1)P(Y=y) = P(Z=1)P(Y=y)

P(Z=2, Y=y) = P(X=2 or 3, Y = y) = P(X=2, Y = y) + P(X=3, Y = y)

= P(X=2)P(Y=y) + P(X=3)P(Y=y)

= (P(X=2)+P(X=3))P(Y=y)

= P(Z=2)P(Y=y)

In other words

P(Z=z,Y=y) = P(Z=z)P(Y=y)

Z and Y are independent

Evidence against independence of Z and Y is also evidence against independence of X and Y
The partial gamma coefficient

X
ABC / Y / 1 / 2 / 3 / Concordance / Discordance / Gamma
111 / 1 / a / b / c / C111 / D111 /
2 / d / e / f
112 / 1 / g / h / i / C112 / D112 /
2 / j / k / l
… / … / … / … / …
222 / 1 / m / n / o / C222 / D222 /
2 / p / q / r
Total / Ctot=abcCabc / Dtot=abcCDabc /

Calculate gamma coefficients in each strata

The partial gamma coefficient is a weighted mean of the gamma coefficients in the different strata of the table

The partial Gamma for social class & “wash hair” calculated over strata defined by PADL and pension Age is  = -0.10 (p =0.742)

Dichotomized items

SF-36

10 items, Y1,..,Y10, with responses coded 0,1 and 2

Set

If Yi  X | S=iYi Zi  X | S=iYi

DIF analysis of dichotomized items

Calculate the PF score from responses to the items with three response categories.

Dependent item : Zi

Independent variables: S and covariates
DIF analysis of PF3, “Carrying groceries”

Exogenous covariates:

-Gender

-Smoking

-Age

-BMI

B / S.E. / Wald / df / Sig. / Exp(B)
Smoking / 1,153 / 4 / ,886
Smoking(0) / 0 / - / - / - / -
Smoking(1) / -,313 / ,421 / ,553 / 1 / ,457 / ,731
Smoking(2) / -,105 / ,698 / ,023 / 1 / ,880 / ,900
Smoking(3) / ,132 / ,272 / ,236 / 1 / ,627 / 1,141
Smoking(4) / -,069 / ,232 / ,089 / 1 / ,766 / ,933
Gender(1) / -,834 / ,208 / 16,045 / 1 / ,000 / ,434
Age / -,012 / ,005 / 5,032 / 1 / ,025 / ,988
bmi / ,061 / ,025 / 5,843 / 1 / ,016 / 1,063
pfscore / 225,266 / 20 / ,000
pfscore(1) / -42,490 / 10906,376 / ,000 / 1 / ,997 / ,000
pfscore(2) / -42,222 / 13228,396 / ,000 / 1 / ,997 / ,000
. / . / . / . / . / . / .
pfscore(19) / -18,925 / 1169,243 / ,000 / 1 / ,987 / ,000
pfscore(20) / -16,424 / 1169,243 / ,000 / 1 / ,989 / ,000

Evidence of DIF relative to Gender, Age and BMI

Psychometric models

We need regression models to describe the relationship between the latent variable and the item responses.

1

Psychometric models

Type of items / Regression model / Psychometric model
Dichotomous (0/1) / / Rasch model:
/ Two parameter IRT model
Ordinal / / Rasch model for ordinal items
Assume that Yi is a categorized version of a continuous response in a linear regression model / Factor analysis
interval scale / Linear regression / Factor analysis

The choice of psychometric model depends on the type of item.

Scale validity requires fit to an appropriate measurement model

The analysis should address all consequences of assumptions of the model:

1)Unidimensionality

2)Monotonicity and consistency

3)Criterion validity

4)Local independence

5)No DIF

The technical requirements

The following requirements are equivalent

Sufficiency

Objectivity

No DIF in the sense that

Itemsexogenous variables |Score

They all imply that the psychometric model has to be a Rasch model

1

The Rasch model is characterized by another graph in addition to the one associated with construct validity

The Rasch graph

Reliability

The definition of reliability of reliability generalizes without changes from classical psychometrics to item response theory and factor analysis.

The conditional distribution of the score given the latent variable, P(S = iYi|  = ), can be determined from the regression models, P(Yi |  = ).

The conditional expected value,

T() = E(S = iYi |  = ),

is a monotonous function of .

T() is the true score

The true score of psychometric models

T() is a function of . The distribution of T is therefore be determined by the distribution of .

It has a variance, Var(T), that can be compared with the variance of S to give us a measure of reliability,

Reliability = Var(T)/Var(S)

Some complications

The measurement error, S – T(), will not in general be independent of T because the variance of the error will be small when T is close to the boundaries of the score range.

The test-retest correlation will therefore not be exactly equal to the reliability.

Some possibilities

The true score is a function of :  T()

The value of  may therefore be estimated by the inverse function: S .

Better measures of reliability may be obtained by

1)comparison of the variance of the estimate of  to var(),

2)calculation of the correlation between the true value of  and the estimate of .

These possibilities will be pursued later.

1