Scale validation:
Analysis of the quality of
summated index scales
Svend Kreiner
Dept. of biostatistics, Univ. of Copenhagen
Index scales
Provides indirect measurement of unobservable phenomena
Defined by functions summarizing responses to a number of items
S = f(Y1,…,Yk)
Examples:
Educational tests
Psychological or psychiatric tests
Measurement of socioeconomic status
Attitude measurements
BMI
Health related quality of measurement scales
Most measurement instruments are summated scales, S = iYi.
BMI and Social class are two examples of scales with other types of scale functions.
Example: CHIPS items – a cognitive test
Figur 1. Fire CHIPS items.
The physical functioning (PF)
subscale of SF-36
Does your health now limit you in
these activities? If so, how much?
PF1) Vigorous activities
PF2) Moderate activities
PF3) Lifting or carrying groceries
PF4) Climbing several flights of stairs
PF5) Climbing one flight of stairs
PF6) Bending, kneeling, or stooping
PF7) Walking more than a mile
PF8) Walking several blocks
PF9) Walking one block
PF10) Bathing or dressing yourself
Three response categories
0: Not limited
1: Limited a little
2 : Limited a lot
1
The PADL (Physical Activities of Daily Living) measure of functional ability of healthy elderly.
Mobility function / Lower limb function / Upper limb functionA: Are you able to walk
indoors? / G: Are you able to wash the
lower part of the body? / G: Are you able to wash the
upper part of the body?
B: Are you able to walk out
of doors in nice weather? / H: Are you able to cut your
toenails? / M: Are you able to cut your
fingernails?
C: Are you able to walk out
of doors in bad weather? / I: Are you able to go to the
toilet yourself? / N: Are you able to comb
your hair?
D: Are you able to manage
stairs? / J: Are You able to dress the
lower part of the body? / O: Are you able to wash
your hair?
E: Are you able to get
outdoors? / K:Are you able to take
shoes/stockings on/off? / P: Are You able to dress the
upper part of the body?
F: Are you able to get up
from a chair or bed?
0 = “Cannot do it at all, or cannot do it without getting tired”
1 = “can do it without getting tired”
1
Do CHIPS, SF-36 and PADLprovide good measurements?
What can we require of high-quality scales?
Type of requirement / Type of considerations / RequirementValidity / Substantive / Face validity
Content validity
Statistical / Criterion validity
Construct validity
Technical / Statistical / Sufficiency
Objectivity
No DIF
Reliability
Sensitivity/specificity
Ability to discriminate
Practical / Simplicity
Feasibility
How many of these requirements doCHIPS, SF-36 and PADL meet?
Face validity
Requires substance matter arguments
Some general requirements related to face validity:
Existence of a latent quantitative or ordinal variable (the construct)
A causal relation with the latent variable as the cause and item responses as effects
Monotonous relationships: E(Yi | = ) is an increasing function of
Some thinking about response behaviour
What construct does the PF subscale measure?
Is it different from the construct measured by the PADL items?
Content validity
A question of item coverage:
A summated scale should include items relating to all relevant aspects of the construct.
The PADL “item bank”
How should items for a short version of the PADL scale be selected?
No content validity
Content validity
Psychometrics
Three different (but related) traditions:
Classical psychometrics (not to be discussed here)
Item response theory for categorical items
Factor analysis (not to be discussed here)
Criterion validity
The score must correlate with all variables known in advance to be correlated to the latent variable
The SF-36 scale must be correlated to self reported health:
PFScore / Self reported health
very good / good / fair / bad / Very bad
0 / 44,8% / 48,4% / 6,6% / ,2%
1 / 21,4% / 61,6% / 16,3% / ,7%
2 / 19,8% / 57,1% / 20,7% / 1,8% / ,5%
3 / 10,5% / 55,3% / 31,6% / 2,6%
4 / 12,5% / 46,2% / 35,6% / 5,8%
5 / 9,5% / 40,5% / 44,6% / 4,1% / 1,4%
6 / 10,5% / 50,9% / 35,1% / 1,8% / 1,8%
7 / 25,7% / 20,0% / 40,0% / 14,3%
8 / 7,1% / 21,4% / 57,1% / 14,3%
9 / 9,4% / 18,8% / 62,5% / 9,4%
10 / 8,6% / 14,3% / 68,6% / 5,7% / 2,9%
11 / 29,6% / 48,1% / 18,5% / 3,7%
12 / 5,0% / 15,0% / 65,0% / 15,0%
13 / 3,7% / 3,7% / 70,4% / 18,5% / 3,7%
14 / 21,4% / 50,0% / 28,6%
15 / 18,2% / 54,5% / 9,1% / 18,2%
16 / 9,1% / 45,5% / 27,3% / 18,2%
17 / 9,1% / 36,4% / 45,5% / 9,1%
18 / 14,3% / 71,4% / 14,3%
19 / 9,1% / 27,3% / 36,4% / 27,3%
20 / 12,5% / 50,0% / 37,5%
Total / 29,4% / 47,8% / 19,0% / 3,0% / ,8%
Goodman & Kruskall’s = 0.607 (p = 0.000)
Criterion validity
The PF score must be correlated with SRH
Goodman & Kruskall’s = 0.607 (p = 0.000)
PADL
Goodman & Kruskall’s = -0.544 (p = 0.000)
Construct validity
Items have to be face valid and item responses must not depend directlyon anything but the latent variable
Illustrated in the IRT graph
Psychometric models are graphical models defined by assumptions concerning conditional independence
Notation
Four types of variables:
Items: Y = (Y1,…,Yk)
The total score:S = iYi
The latent trait variable:
Exogenous variables:X = (X1,…Xm)
Conditional independence:
XY|Z P(X|Y,Z) = P(X|Z)
Construct validity
Construct validity requires
1)Unidimensionality
2)Monotonicity and causality
3)Local independence (YiYj | )
4)No differential item functioning (DIF): (YiXj | )
Consequences of construct validity
1)Items must be positively correlated
2)Items must be positively correlated with rest scores
3)If the score correlates with an exogenous variable, X, then all items must correlate with X in the same way.
4)If the latent variable is monotonously correlated with an exogenous variable, X, then the same is required for the score (criterion validity)
1) – 3) are requirements of consistency
4) is a requirement of criterion validity
1
Association among PF items (gamma coefficients)
rest
A B C D E F G H I J score
------
A Vig.act. Gamma 0.915 0.889 0.861 0.873 0.835 0.858 0.839 0.821 0.845 0.860
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
B Mod.act Gamma 0.915 0.960 0.886 0.935 0.875 0.910 0.929 0.925 0.918 0.919
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
C Liftgroc Gamma 0.889 0.960 0.886 0.943 0.868 0.900 0.928 0.934 0.922 0.909
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
D Stair2+ Gamma 0.861 0.886 0.886 0.982 0.896 0.937 0.948 0.949 0.902 0.892
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
E Stair1 Gamma 0.873 0.935 0.943 0.982 0.945 0.948 0.962 0.966 0.942 0.958
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
F Bending Gamma 0.835 0.875 0.868 0.896 0.945 0.898 0.925 0.930 0.922 0.870
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
G Walk 1m Gamma 0.858 0.910 0.900 0.937 0.948 0.898 0.980 0.968 0.914 0.918
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
H Walk 2+b Gamma 0.839 0.929 0.928 0.948 0.962 0.925 0.980 0.997 0.941 0.959
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
I Walk 1bl Gamma 0.821 0.925 0.934 0.949 0.966 0.930 0.968 0.997 0.955 0.959
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
J Bathing Gamma 0.845 0.918 0.922 0.902 0.942 0.922 0.914 0.941 0.955 0.914
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Items and Exogeneous variables
A B C D E F G H I J score
------
K srh Gamma -0.627 -0.729 -0.672 -0.711 -0.748 -0.683 -0.752 -0.831 -0.852 -0.814 -0.607
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
L BMI Gamma -0.221 -0.164 -0.186 -0.299 -0.294 -0.328 -0.250 -0.294 -0.241 -0.282 -0.224
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
M Smoking Gamma 0.095 0.013 -0.011 0.051 -0.018 -0.046 0.004 -0.055 -0.073 -0.020 0.071
p 0.001 0.388 0.403 0.114 0.385 0.145 0.464 0.198 0.162 0.395 0.006
N Sex Gamma -0.201 -0.365 -0.379 -0.264 -0.212 -0.173 -0.197 -0.155 -0.183 -0.008 -0.203
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 0.009 0.457 0.000
O Age Gamma -0.542 -0.628 -0.690 -0.617 -0.765 -0.590 -0.676 -0.760 -0.741 -0.679 -0.516
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Item responses are consistent:
No evidence against construct validity
1
Differential item functioning (DIF)
The unobserved latent variable is the only variable that influences item responses under the assumption of construct validity
The observed item responses can therefore provide indirect information on (measurement of) the value of the latent variable.
DIF means that item responses are influenced by other variables. Measurement will therefore be confounded
Confounding of measurement Due to DIF
To avoid confounding of measurement we have to identify and remove differentially functioning items
1
Analysis of differential item function
This is almost, but not quite, the same requirement of no DIF found in connection with the definition of construct validity.
To distinguish between the two types of DIF we refer to structure related and score related DIF.
Mantel-Haenszel procedures and partial gamma coefficients test whether there is no score related DIF.
Are there any psychometric models satisfying
both requirements of no DIF?
Answer: The so-called Rasch models are the
only known models with this property.
Analysis of DIF
Three different types of analysis:
Mantel-Haenszel analysis in three-way tables
Analysis by partial Gamma coefficients
Logistic regression analysis
Al three types of analysis tests the same hypothesis of conditional independence:
Item Covariate | Score
DIF analyses of PADL
Is “Wash hair” biased relative to sex?
padl / washhair / Totaltired or worse / not tired
,00 / Sex / Male / 5 / 5
Female / 7 / 7
Total / 12 / 12
1,00 / Sex / Male / 1 / 1
Female / 1 / 1
Total / 2 / 2
2,00 / Sex / Male / 1 / 1
Female / 3 / 3
Total / 4 / 4
3,00 / Sex / Male / 2 / 1 / 3
Female / 2 / 0 / 2
Total / 4 / 1 / 5
4,00 / Sex / Male / 1 / 1 / 2
Female / 3 / 0 / 3
Total / 4 / 1 / 5
. / . / . / . / . / .
12,00 / Sex / Male / 0 / 21 / 21
Female / 3 / 25 / 28
Total / 3 / 46 / 49
13,00 / Sex / Male / 0 / 45 / 45
Female / 4 / 46 / 50
Total / 4 / 91 / 95
14,00 / Sex / Male / 35 / 35
Female / 26 / 26
Total / 61 / 61
15,00 / Sex / Male / 0 / 24 / 24
Female / 1 / 43 / 44
Total / 1 / 67 / 68
16,00 / Sex / Male / 162 / 162
Female / 133 / 133
Total / 295 / 295
The Mantel-Haenszel analysis:
1)calculates odd-ratios in each stratum
2)tests that all odds-ratios are the same
3)test that odds-ratios are equal to one (conditional independence)
Strata / Odds-ratio / test3 and 4 / 0.0 / Breslow-Day test of
equal odd ratios
2= 6.0 df = 9 p = 0.74
5 / 0.05
7 / 2.0 / Mantel-Haenszel test
of conditional ind.
2= 5.1 df = 1 p = 0.025
8 / 0.0
9 / 0.5 / MH- estimate = 0.233
10 / 0.583
12, 13 and 15 / 0.0
Evidence of differential item functioning
Logistic regression will do the same trick
Dependent variable : WASHHAIR
Independent : Sex + PADL score
The PADL score has to be defined as a categorical variable
B / S.E. / Wald / df / Sig. / Exp(B)padl / 46,121 / 16 / ,000
padl(1) / -42,609 / 11533,591 / ,000 / 1 / ,997 / ,000
padl(2) / -42,721 / 27839,275 / ,000 / 1 / ,999 / ,000
padl(3) / -42,350 / 19723,588 / ,000 / 1 / ,998 / ,000
padl(4) / -23,003 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(5) / -22,719 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(6) / -21,630 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(7) / -21,352 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(8) / -21,336 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(9) / -22,347 / 2281,692 / ,000 / 1 / ,992 / ,000
padl(10) / -20,464 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(11) / -19,617 / 2281,692 / ,000 / 1 / ,993 / ,000
padl(12) / -,101 / 7618,819 / ,000 / 1 / 1,000 / ,904
padl(13) / -18,296 / 2281,692 / ,000 / 1 / ,994 / ,000
padl(14) / -17,949 / 2281,692 / ,000 / 1 / ,994 / ,000
padl(15) / -,034 / 5509,874 / ,000 / 1 / 1,000 / ,967
padl(16) / -16,723 / 2281,692 / ,000 / 1 / ,994 / ,000
Sex(1) / -1,435 / ,580 / 6,124 / 1 / ,013 / ,238
Logistic regression permits
1)Analysis of several different potential sources of DIF at the same time
2)Covariates with more than two categories
Variables in the Equation
B / S.E. / Wald / df / Sig. / Exp(B)padl(1) / -43,018 / 10945,971 / ,000 / 1 / ,997 / ,000
padl(2) / -43,145 / 28314,566 / ,000 / 1 / ,999 / ,000
. / . / . / . / . / . / .
padl(15) / -,083 / 5537,202 / ,000 / 1 / 1,000 / ,920
padl(16) / -16,708 / 2373,083 / ,000 / 1 / ,994 / ,000
Sex(1) / -,773 / ,660 / 1,369 / 1 / ,242 / ,462
socgrp / 1,908 / 6 / ,928
socgrp(1) / -1,378 / 1,626 / ,718 / 1 / ,397 / ,252
socgrp(2) / -1,472 / 1,441 / 1,043 / 1 / ,307 / ,229
socgrp(3) / -1,337 / 1,443 / ,858 / 1 / ,354 / ,263
socgrp(4) / -1,728 / 1,352 / 1,635 / 1 / ,201 / ,178
socgrp(5) / -1,608 / 1,946 / ,683 / 1 / ,409 / ,200
socgrp(6) / -1,027 / 1,928 / ,283 / 1 / ,594 / ,358
pensage / ,130 / ,043 / 8,936 / 1 / ,003 / 1,139
Pension age appears to be the real DIF source:
The later that you have been pensioned, the better your ability to wash your hair after the total PADL score have been taken into account.
Pension Age confounds the measurement of physical functioning.
Social class is not a source of DIF
Sex does not appear to be a source of DIF when Pension age and Social Class have been taken into account.
To check the result we remove Social Class from the analysis.
B / S.E. / Wald / df / Sig. / Exp(B)padl(1) / -42,479 / 11693,480 / ,000 / 1 / ,997 / ,000
padl(16) / -16,778 / 2396,101 / ,000 / 1 / ,994 / ,000
Sex(1) / -,830 / ,642 / 1,671 / 1 / ,196 / ,436
pensage / ,122 / ,040 / 9,089 / 1 / ,003 / 1,130
Sex is still insignificant
Ordinal variables
The problems:
1)Ordinal variables are treated as nominal by logistic regression
2)Items with more than two categories cannot be handled by conventional logistic regression models.
The solutions:
1)Dichotomize items before the analysis by DIF, but after calculation of the score.
2)Use partial gamma coefficients instead of Mantel-Haenszel statistics and logistic regression
Why does dichomization work?
Assume that
1)X and Y are independent
2)X has three categories
Dichotomize X:Z = 1 if X= 1
Z = 2 if X = 2 or 3
Independence of X and Y imples that Z and Y also are independent:
P(Z=1, Y = y) = P(X=1, Y = y) = P(X=1)P(Y=y) = P(Z=1)P(Y=y)
P(Z=2, Y=y) = P(X=2 or 3, Y = y) = P(X=2, Y = y) + P(X=3, Y = y)
= P(X=2)P(Y=y) + P(X=3)P(Y=y)
= (P(X=2)+P(X=3))P(Y=y)
= P(Z=2)P(Y=y)
In other words
P(Z=z,Y=y) = P(Z=z)P(Y=y)
Z and Y are independent
Evidence against independence of Z and Y is also evidence against independence of X and Y
The partial gamma coefficient
ABC / Y / 1 / 2 / 3 / Concordance / Discordance / Gamma
111 / 1 / a / b / c / C111 / D111 /
2 / d / e / f
112 / 1 / g / h / i / C112 / D112 /
2 / j / k / l
… / … / … / … / …
222 / 1 / m / n / o / C222 / D222 /
2 / p / q / r
Total / Ctot=abcCabc / Dtot=abcCDabc /
Calculate gamma coefficients in each strata
The partial gamma coefficient is a weighted mean of the gamma coefficients in the different strata of the table
The partial Gamma for social class & “wash hair” calculated over strata defined by PADL and pension Age is = -0.10 (p =0.742)
Dichotomized items
SF-36
10 items, Y1,..,Y10, with responses coded 0,1 and 2
Set
If Yi X | S=iYi Zi X | S=iYi
DIF analysis of dichotomized items
Calculate the PF score from responses to the items with three response categories.
Dependent item : Zi
Independent variables: S and covariates
DIF analysis of PF3, “Carrying groceries”
Exogenous covariates:
-Gender
-Smoking
-Age
-BMI
B / S.E. / Wald / df / Sig. / Exp(B)Smoking / 1,153 / 4 / ,886
Smoking(0) / 0 / - / - / - / -
Smoking(1) / -,313 / ,421 / ,553 / 1 / ,457 / ,731
Smoking(2) / -,105 / ,698 / ,023 / 1 / ,880 / ,900
Smoking(3) / ,132 / ,272 / ,236 / 1 / ,627 / 1,141
Smoking(4) / -,069 / ,232 / ,089 / 1 / ,766 / ,933
Gender(1) / -,834 / ,208 / 16,045 / 1 / ,000 / ,434
Age / -,012 / ,005 / 5,032 / 1 / ,025 / ,988
bmi / ,061 / ,025 / 5,843 / 1 / ,016 / 1,063
pfscore / 225,266 / 20 / ,000
pfscore(1) / -42,490 / 10906,376 / ,000 / 1 / ,997 / ,000
pfscore(2) / -42,222 / 13228,396 / ,000 / 1 / ,997 / ,000
. / . / . / . / . / . / .
pfscore(19) / -18,925 / 1169,243 / ,000 / 1 / ,987 / ,000
pfscore(20) / -16,424 / 1169,243 / ,000 / 1 / ,989 / ,000
Evidence of DIF relative to Gender, Age and BMI
Psychometric models
We need regression models to describe the relationship between the latent variable and the item responses.
1
Psychometric models
Type of items / Regression model / Psychometric modelDichotomous (0/1) / / Rasch model:
/ Two parameter IRT model
Ordinal / / Rasch model for ordinal items
Assume that Yi is a categorized version of a continuous response in a linear regression model / Factor analysis
interval scale / Linear regression / Factor analysis
The choice of psychometric model depends on the type of item.
Scale validity requires fit to an appropriate measurement model
The analysis should address all consequences of assumptions of the model:
1)Unidimensionality
2)Monotonicity and consistency
3)Criterion validity
4)Local independence
5)No DIF
The technical requirements
The following requirements are equivalent
Sufficiency
Objectivity
No DIF in the sense that
Itemsexogenous variables |Score
They all imply that the psychometric model has to be a Rasch model
1
The Rasch model is characterized by another graph in addition to the one associated with construct validity
The Rasch graph
Reliability
The definition of reliability of reliability generalizes without changes from classical psychometrics to item response theory and factor analysis.
The conditional distribution of the score given the latent variable, P(S = iYi| = ), can be determined from the regression models, P(Yi | = ).
The conditional expected value,
T() = E(S = iYi | = ),
is a monotonous function of .
T() is the true score
The true score of psychometric models
T() is a function of . The distribution of T is therefore be determined by the distribution of .
It has a variance, Var(T), that can be compared with the variance of S to give us a measure of reliability,
Reliability = Var(T)/Var(S)
Some complications
The measurement error, S – T(), will not in general be independent of T because the variance of the error will be small when T is close to the boundaries of the score range.
The test-retest correlation will therefore not be exactly equal to the reliability.
Some possibilities
The true score is a function of : T()
The value of may therefore be estimated by the inverse function: S .
Better measures of reliability may be obtained by
1)comparison of the variance of the estimate of to var(),
2)calculation of the correlation between the true value of and the estimate of .
These possibilities will be pursued later.
1