Rasch Intro - 20
Introduction to Rasch Modeling – BF Ch 2
Two general types of scales
I. Scales for which there are one or more (but usually 1) correct answers to each item – Correct Answer Scales
Items from such scales may allow multiple responses.
Typically, scoring is dichotomous,
1=all correct responses; 0=all incorrect responses
Usually there is only 1 correct answer and multiple incorrect answers.
In the vast majority of informal situations involving such tests (e.g., by people like me)
1) items are not selected to represent different difficulty levels
2) a respondent’s score is represented by the number of correct responses, e.g. sum of 1s.
3) only respondent scores are obtained (along with reliability estimates).
4) there is no formal statement of the probability of responding to each item.
II. Scales for which there is no correct answer – Dimension scales
Such scales typically allow multiple responses.
Each response is given a different value.
Response values are typically ordered with respect to the dimension represented by the items – larger response values represent “more” of whatever dimension is represented.
e.g., the IPIP conscientiousness items . . .
3 Am always prepared. 1
8 Leave my belongings around. 0
13 Pay attention to details. 1
18 Make a mess of things. 0
23 Get chores done right away. 1
28 Often forget to put things back in their proper place. 0
33 Like order. 1
38 Shirk my duties. 0
43 Follow a schedule. 1
48 Am exacting in my work. 1
Typically, items are chosen so that they are “more or less equivalent ‘detectors’ of the phenomenon of interest – that is they are more or less parallel . . .” and “They are imperfect indicators of a common phenomenon that can be combined by a simple summation into an acceptably reliable scale.” (DeVellis, R. F. (2012). Scale Development: Theory and Applications. Thousand Oaks, CA: Sage.)
In the vast majority of situations using such scales
1) items are not selected to represent different “positions” on the dimension of interest
2) respondent scores are the sum preferably the mean of values assigned to the different responses
3) only respondent scores are obtained (along with reliability estimates).
4) there is no formal statement of the probability of responding to each item.
Some things that could be different
1) Items could be scored along with respondents
2) Items could be selected based on difficulty or position on the dimension.
3) The validity of use of the sum of response values as a measure of the respondent could be examined.
4) The “rationality” of a respondent’s responses could be examined.
We typically don’t do ANY of these things. We’re so bad at measurement.
All of these are part of testing based on item response theory.
IRT does the following
1) It scores items as well as scoring respondents.
Item scores are on the same metric as respondent scores.
They can be put on the same graph and compared, if that’s what you want to do.
2) Because items can be scored, they can be ordered with respect to difficulty.
This allows the tester to insure that there are items to cover all levels of respondent ability for Correct Answer Scales or all positions on a dimension for Dimension Scales.
Thus, groups of respondents who missed all items because all items were too hard for them or who got all items correct because all items were too easy for them and thus can’t be discriminated will not occur. All individuals, in principle, can be ordered.
It also allows for adaptive testing, avoiding presenting multiple items to an individual that are too easy or too difficult. Once the person’s ability is estimated, only those items around that ability level need be presented.
3) Scoring to eliminate floor or ceiling effects associated with the use of summated scores can be considered. Bond and Fox use the phrase “interval scaled” in this regard.
4) If the items are ordered in terms of difficulty, then a person should get all the easy items correct and miss all the items that are too difficult for him/her. So, a person who misses easy items while getting more difficult items correct can be identified and dealt with appropriately.
Estimating Person Ability and Item difficulty – for Correct Answer Scales
Rasch modeling begins with the following table of responses to items of a test.
From it, estimates of person ability and item difficulty are obtained.
The example data are those in Table 2.1 in BF Ch 2.
In the example data, rows are people and columns are items. A check mark indicates that an item was answered correctly, and an X indicates that an item was incorrect.
The Raw Score column at the left contains the number of check marks for each person – an initial estimate of person ability.
Note that person M missed all the items, and person N got them all correct.
Person Ability ~~ Natural logarithm of (number correct /number incorrect)
= ln(# correct/# incorrect)
These are the same data as in Table 2.1 shown in the SPSS data editor window.
The check marks have been changed to 1s and the Xs changed to 0s.
The column ptot contains total scores.
The syntax to create prasch was a log odds transformation . . .
compute prasch = ln(ptot / (12-ptot)).
execute.
Note that Cases M and N have no values of prasch. That’s because the logarithm of 0 is not defined. Minor technical glitch.
The relationship of ptot to prasch is . . .
The “new, improved” measure is almost perfectly linearly related to the old measure!! Hmm.
Item Difficulty -
The data from Bond and Fox, Ch. 2 again . . .
The itot column is one that you don’t see in most informal testing situations.
The itot column is the number of persons who got the item correct – item easiness.
The Rasch estimate is the inverse of itot. It is called item difficulty,
The syntax to compute irash is a log odds transformation
compute irash = ln((14-itot)/(itot)).
execute.
The relationship of itot to irasch is as follows . . .
As was the case with person ability scores, the relationship for these data is quite nearly linear.
Recall that itot represents item easiness while irasch represents item difficulty.
Comments on BF Chapter 2.
Scalogram view is a table of responses to items which has been
sorted by rows so that the highest person totals are at the top and the lowest at the bottom and
sorted by columns so that the highest item totals are at the left and the lowest are at the right.
Original View of the data
Scalogram view of the same data
The most able persons are at the top of this view and the easiest items are at the left.
Deletion of persons, items with all 0s or 1s
While the summated score of a person who got all items correct might make perfect sense, in Rasch analysis, 100% correct (or 0% correct) corresponds to a Rasch value of + infinity (or – infinity). The same problem applies to items which all persons got correct or all persons missed. For this reason, “perfect” data are typically excluded from Rasch modeling. They may be physically deleted from the data set by us or the programs designed to compute estimates may automatically delete them.
Inadequacy of percentages, according to Bond and Fox:
“ . . . these n/N fractions should be regarded as merely orderings of the nominal categories, and as insufficient for the inference of interval relationship between the frequencies of observations.”
Inconsistency of a persons and items.
A striking characteristic of Rasch modeling is the ability to automatically identify persons and items whose response patterns are not what they should be.
Such persons or items can sometimes be identified from examination of the Scalogram view.
Consider person K, for example. Person K got 4 easy items (c, I, a, and l) correct, then missed more difficult items b and h, then got item k correct, then missed item d, then got item f correct. Person K should have had a series of check marks followed by a series of Xs.
Consider item d for example. This item was answered corrected by the two most able persons (N and J), then missed by C, E, and L, then answered correctly the I and F, then missed by K, A, and G, then answered correctly by D and B. Item d should have had, reading from top to bottom, a string of check marks followed by a string of Xs.
We can get indications of bad items in the RELIABILITY procedure in SPSS using the “Alpha if item deleted” column. The contribution to reliability is related to the item inconsistency pointed out above, although they’re not identical.
But there is not a commonly used measure of person inconsistency in casual ability testing.
The issue of scaling.
B and F argue, correctly, that the ability difference reflected by the difference between two percentages near the middle of a distribution may not be equivalent to the ability difference reflected by an equal difference between two percentages near the tail of a distribution.
“Earning a few extra marks near the midpoint of the test results, say from 48 to 55, does not reflect the same ability leap required for a move from 88 to 956 at the top of the test or from 8 to 15 at the bottom.” P 24
They then claim that a log odds transformation has the following characteristics . . .
“In addition to transforming the score from a merely ordinal scale to a mathematically more useful interval scale, a log odds scale avoids the problem of compression at the ends of the raw score scale due to its restricted range . .”
Getting computer estimates
The log odds transformation shown above gives a rough estimate of the person ability and item difficulty values.
In practice, however, computer programs, such as WINSTEPS, provided in stripped down fashion with the text, are used to estimate
an ability value, Bn, for each person, and
a difficulty value, Di, for each item.
Values of Bn and Di are found for each person and item such that
e(Bn-Di)
P(x=1|B,D) = ------is as close to the observed value of 0 or 1.
1 + e(Bn-Di)
Note that the above expression cannot be smaller than 0 or larger than 1.
Some example computations from the text.
Suppose a person’s ability was 3 (very high) and that person was responding to an item whose difficulty was 1 (kind of high).
2.7183(3-1) 2.71832
Pni(x=1|B(3),D(1)) = ------= ------= 0.88
1+2..7183(3-1) 1+2.71832
The estimation is done, typically by a “hill-climbing” algorithm that searches all possible combinations until the set of B and D values is found that collectively minimizes the differences between the formula and the observed 0s and 1s.
This is a VERY simplified description of the process.
Some scientists have defined their careers on discovery of elegant ways to “climb the hill”.
Isaac Newton in the 1600s provided key mathematical insights into ways to improve the climb.
Amazingly, the estimates that are obtained using the sophisticated alorithms are roughly similar to the estimates we just obtained using the simple ln(p/(1-p)) formula. We’ll monitor those differences as we proceed.
Running the Bond and Fox Steps program.
Opening Screen Shot
Data are accessed by pulling down the “Data Files” menu. I did this and selected BondFoxChapter2.txt. The result is
The data file that was just opened is simply a text file located in the Bond &FoxSteps folder.
Its full name is “C:\Bond&FoxSteps\Bond-data\ Bond&FoxChapter2.txt “
The menu sequence on my computer to open it so that its contents can be examined is
Start menu -> Computer -> Local Disk (C:) -> Bond&FoxSteps -> Bond-data -> Bond&FoxChapter2.txt
Since the file is a text file (not a Word or Excel or other MS doc), it opens in Notepad.
The program, Bond&FoxSteps is designed to read the file and display its contents in the fashion shown above.
&INST ; initial line (can be omitted)
Title = "Bond & Fox Table 2.1 Math Curriculum"
Item1 = 3 ; Response to first item is in column 3
NI = 12 ; Number of items is 12, items a-l
Name1 = 1 ; Person identifier starts in column 1
Xwide = 2 ; Each observation is 2 columns wide
Codes = "r X " ; valid codes are "r " and "X "
Newscore= "1 0 " ; r scored 1, and X scored 0
Total = Yes ; show total raw scores
&End
a 37+2 ; 12 item labels, one per line
b 56+4 ; Please use item text, where possible!
c 1+4 ; so I've added some dummy item text
d 27.3+34.09
e 4 1/4 + 2 1/8
f 2/4 + 1/4
g 4 1/2 + 2 5/8
h 86+28
i 6+3
j $509.74+93.25
k 2391+547+1210
l 7+8
; a b c d e f g h i j k l Score ; a comment to help read the data section.
END LABELS
A r r r X X X X r r X X r 6
B r X r r X X X X r X X X 4
C r r r X r X X r r r r r 9
D r X r r X X X X r X X r 5
E X r r X X r X r r r r r 8
F r r r r X X X r r X X r 7
G r X r X X r X X r X r r 6
H r X r X X X X X X X X r 3
I r r r r X X X r r X X r 7
J r r r r r r r r r X r X 10
K r X r X X r X X r X r r 6
L X r r X X r X r r r r r 8
M X X X X X X X X X X X X 0
N r r r r r r r r r r r r 12
We’ll have to learn some of the “syntax” shown above to manipulate our own data files. Later.
The Analysis of the Chapter 2 data.