EDEP 651: Final Project 3
Running head: MODERN MEASUREMENT
Final Project: Scale Equating, Item Response Theory, and Differential Item Functioning
Paper submitted in partial fulfillment of requirements for
EDEP 651: Modern Measurement Applications for Education and Behavioral Sciences
Author: Faye Huie
December 12, 2010
GRADUATE SCHOOL OF EDUCATION
George Mason University
Fairfax VA
Dimiter Dimitrov, Ph.D., Instructor
Fall 2010
EDEP 651: Final Project 3
Introduction
The purpose of this final project is to synthesize and integrate the knowledge accumulated in EDEP 651: Modern Measurement.
Equating (pg 3) is a statistical process generating comparable scores on different tests. Specifically, equating attempts to address the question: can the scores from different tests measuring the same construct be comparable to each other? For example, if two forms of a test of math knowledge are administered to one class of students, we must be sure that differences in achievement on the two tests are in fact due to ability differences as opposed to difficulty differences between the items on the two forms. The method used to equate the two tests (Test Y and Test X) here is called the anchor/common item approach. This approach allows us to take common items from both the tests and base the difficulty and discrimination indices from test (X) to make it comparable with test (Y). This can be done because we assume that the parameters from the two forms are linearly related. Therefore, we can equate the two scales by using this following formula:
1)
This is a the mathematical expression of the linear relationship between the Y and X common items. From this formula, we can calculate the constants alpha and beta for the common items on both test Y and X. Following, we use these two formulas to transform the difficulty and discrimination values on test X to scale it to test Y:
Difficulty:
Discrimination:
To determine the new difficulty and discrimination values for the anchor items. We just take the newly equated value and average it with the value on test Y. See page 3 for the results of equating the scores on both test Y and test X.
Item Response Theory (Pg 5) is a process through which we can examine the item parameters and person parameters on the same scale. This process thus shows that the probability for answering a question correctly given a person’s ability only depends on item difficulty.
Probability for correct response on item i = log (P/1-P) = θn – δi
Therefore, IRT addresses a major pitfall of CTT—where CTT is test dependent, IRT is not. See page 5 for further discussion of IRT as well as my interpretation of the output from a two parameter model.
Differential Item Functioning (pg 10) is a process through which we can examine issues of fairness. Specifically, items are functioning differentially if people have different probabilities for answering an item correctly when they have the same ability score. There are several different methods for assessing differential item functioning, but the most robust method is to use the simulated data method. Because the simulated data are based on item parameters from the entire group for both the focal and reference groups individually, simulated data has no DIF and any area inbetween the two curves represent random fluctuations. Therefore, we can be confident that the greatest area between the two simulated ICCs is the maximum value to use as the cut-off score for the actual data. Any item with an area surpassing the maximum cut off value would then be determined as differentially functioning. See page 10 for a fuller explanation as well as a step by step approach to the simulated data method of detecting DIF.
EDEP 651: Final Project 3
1) Equating
Equating is a statistical process that allows scores on one test to be on the same scale as the scores on another similar test. This allows researchers to combine two tests into one—even when they are on two separate scales. The only requirement in this case is for both tests to have “common/anchor items”. Common/anchor items are items that are similar for both tests. The anchor items provide a framework for scaling the different tests into one. For instance, students are typically separated into different classes within a school. If students from three fourth grade classes took three separate math tests, equating would be a process that would combine the results of the three tests to put the scores from the different tests on the same scale.
Please see the excel file to look at how I equated Tests X and Y specifically with the formulas
Item Number / Test_Ya / Test_Xa / Equated Values_Discrimination / Test_Yb / Test_Xb / Equated Values_Difficulty / Alpha value / Beta value6 / 0.43 / 0.4687 / -0.77 / -0.9283 / 1.09170349 / -0.088812
7 / 0.63 / 0.6867 / 0.34 / 0.2816
8 / 0.58 / 0.6322 / -0.14 / -0.2416 / -0.222 = / Average difficulty on test Y common items
9 / 0.39 / 0.4251 / 0.87 / 0.8593 / -0.122 = / Average difficulty on test X common items
10 / 0.81 / 0.8829 / -0.96 / -1.1354
11 / 0.46 / 0.5014 / 0.55 / 0.5105
12 / 0.42 / 0.4578 / 0.14 / 0.0636
13 / 0.74 / 0.8066 / -0.84 / -1.0046
14 / 0.46 / 0.5014 / -0.85 / -1.0155
15 / 0.4 / 0.436 / 1.1 / 1.11
16 / 0.47 / 0.5123 / -0.09 / -0.1871
17 / 0.54 / 0.5886 / -0.63 / -0.7757
18 / 0.5 / 0.545 / -1.42 / -1.6368
19 / 0.43 / 0.4687 / Equated Scores for Anchor items / -0.11 / -0.2089 / Equated Scores for Anchor items
1c / 0.45 / 0.7 / 0.763 / 0.6065 / -1.17 / -0.88 / -1.0482 / -1.1091
2c / 0.44 / 0.51 / 0.5559 / 0.49795 / -2.5 / -2.25 / -2.5415 / -2.52075
3c / 0.33 / 0.42 / 0.4578 / 0.3939 / 0.14 / 0.17 / 0.0963 / 0.11815
4c / 0.33 / 0.4 / 0.436 / 0.383 / -0.17 / -0.11 / -0.2089 / -0.18945
5c / 0.36 / 0.38 / 0.4142 / 0.3871 / 2.59 / 2.46 / 2.5924 / 2.5912
0.5 / 0.5 / -0.62 / 0.5
0.53 / 0.53 / -1.01 / 0.53
0.4 / 0.4 / 0.23 / 0.4
0.36 / 0.36 / 1.96 / 0.36
0.41 / 0.41 / 0.39 / 0.41
0.41 / 0.41 / 1.95 / 0.41
0.7 / 0.7 / -0.31 / 0.7
0.44 / 0.44 / 0.87 / 0.44
0.76 / 0.76 / -0.25 / 0.76
0.49 / 0.49 / 1.04 / 0.49
0.58 / 0.58 / 1.07 / 0.58
0.53 / 0.53 / -0.9 / 0.53
0.42 / 0.42 / 1.65 / 0.42
0.48 / 0.48 / 0.45 / 0.48
0.45 / 0.45 / 0.11 / 0.45
EDEP 651: Final Project 3
2) IRT Analysis: Analyzing output
RASCH analysis and IRT analysis are similar, but conceptually different. Specifically, the goal of RASCH modeling is to selects item that align with ability levels. The assumption here is that we get the most accurate ability measures when their actual ability aligns with the item difficulty. Therefore, we would get no information in RASCH analysis if all students answered questions either correctly or incorrectly—we need an even number of students who get the answer both right and wrong so we can attain accurate person parameters. However, in IRT analysis, we are testing whether the model fits the data. This is why there is only one item parameter measured in RASCH analysis (item difficulty) where there can be three item parameters modeled in IRT analysis (difficulty, discrimination, and pseudo-guessing). IRT can then assess several different models to see how many parameters should be included the model that fits the data the best would then be used to analyze the items.
Please see below for the output from XCaliber and my interpretation of the results. All of my interpretations are in RED
First, we need to see if the theta scores are normally distributed or not with SPSS Q-Q Plot:
From here, we can see that the theta scores are indeed, normally distributed.
XCALIBRE (tm) for Windows -- Version 1.10 Page 2
Copyright (c) 1995 by Assessment Systems Corporation, All Rights Reserved
Marginal Maximum-Likelihood IRT Parameter Estimation Program
Date: Dec 02, 2010 Time: 3:37 PM
The input was from file: C:\0\Z.DAT
The number of items was: 34
There was no item linkage
The key was:
1111111111111111111111111111111111
The numbers of alternatives were:
2222222222222222222222222222222222
The inclusion specifications were:
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
The maximum parameter change on loop 1 was 0.464
The maximum parameter change on loop 2 was 0.328
The maximum parameter change on loop 3 was 0.123
The maximum parameter change on loop 4 was 0.060
The maximum parameter change on loop 5 was 0.035
Mean Number-Correct Score = 16.979
Number-Correct Standard Deviation = 5.535
K-R 21 Reliability = 0.770
The number of examinees was 703
Final Parameter Summary Information:
Mean SD
Theta 0.00 1.00
a 0.46 0.10
b 0.09 1.05
c 0.00 0.00
XCALIBRE (tm) for Windows -- Version 1.10 Page 3
Copyright (c) 1995 by Assessment Systems Corporation, All Rights Reserved
Marginal Maximum-Likelihood IRT Parameter Estimation Program
Date: Dec 02, 2010 Time: 3:37 PM
FINAL ITEM PARAMETER ESTIMATES
Item Lnk Flg a b c Resid PC PBs PBt N Item name
------
1 0.57 -0.99 0.00 0.55 0.67 0.43 0.45 703
2 0.48 -2.30 0.00 0.53 0.83 0.30 0.30 703
3 0.36 0.16 0.00 1.21 0.48 0.25 0.25 703
4 0.37 -0.14 0.00 1.14 0.52 0.25 0.23 703
5 0.36 2.55 0.00 1.65 0.20 0.16 0.16 703
6 0.40 -0.80 0.00 0.97 0.61 0.31 0.29 703
7 0.54 0.35 0.00 0.53 0.44 0.43 0.44 703
8 0.53 -0.18 0.00 0.67 0.53 0.42 0.42 703
9 0.37 0.88 0.00 0.98 0.39 0.27 0.26 703
10 0.70 -1.03 0.00 0.51 0.70 0.49 0.52 703
11 0.40 0.58 0.00 0.67 0.42 0.31 0.30 703
12 0.38 0.13 0.00 1.31 0.49 0.27 0.26 703
13 0.63 -0.91 0.00 0.44 0.67 0.47 0.49 703
14 0.44 -0.86 0.00 0.51 0.63 0.36 0.35 703
15 0.37 1.17 0.00 1.24 0.35 0.25 0.25 703
16 0.43 -0.12 0.00 0.56 0.52 0.35 0.33 703
17 0.54 -0.66 0.00 0.82 0.61 0.42 0.43 703
18 0.50 -1.38 0.00 0.56 0.72 0.38 0.38 703
19 0.43 -0.13 0.00 0.61 0.52 0.34 0.33 703
20 0.48 -0.64 0.00 0.82 0.60 0.38 0.38 703
21 0.49 -1.03 0.00 0.60 0.66 0.39 0.39 703
22 0.36 0.27 0.00 1.06 0.47 0.26 0.24 703
23 0.34 1.99 0.00 1.25 0.27 0.19 0.17 703
24 0.36 0.42 0.00 1.08 0.45 0.26 0.24 703
25 0.42 1.89 0.00 0.86 0.24 0.28 0.28 703
26 0.67 -0.29 0.00 0.28 0.56 0.50 0.52 703
27 0.39 0.90 0.00 0.71 0.38 0.30 0.29 703
28 0.66 -0.25 0.00 0.62 0.55 0.50 0.52 703
29 0.47 1.04 0.00 0.55 0.34 0.37 0.37 703
30 0.55 1.12 0.00 0.78 0.31 0.42 0.43 703
31 0.44 -1.00 0.00 0.79 0.65 0.35 0.34 703
32 0.39 1.71 0.00 0.71 0.27 0.28 0.27 703
33 0.42 0.49 0.00 0.59 0.43 0.33 0.33 703
34 0.40 0.15 0.00 0.74 0.48 0.30 0.29 703
Test characteristics: K-R 21 Expected Average
Reliability Information Information
0.770 4.242 3.431
Page 6
Test Information Curve
8.0 I
I
6.0 I
I
I
I
I I
n I
f I ********
o I *** | ***
r I ** | **
m I * | **
a 4.0 I ** | **
t I * | **
i I ** | *
o I * | **
n I * | **
I ** | **
I * | *
I ** | **
I * | **
I ** | **
2.0 I * | **
I * | **
I | *
I------
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
ability
This test information curve is the sum of all item information curves. The highest point of this curve indicates the smallest standard error for the estimation of theta. For example, in this test, we can see that the highest point of the curve is when ability is about between -1 and 0. This indicates that this test provides the most accurate information for students with an ability score of -1 and 0. Therefore, this point would be good to use as a cutting score for criterion referenced tests—people above 0 would pass the test while people below -1 would fail the test. The highest accuracy would be provided at this cutting score.
Test Characteristic Curve
1.00 I
I
E I **
s I ***
t I ***
i I ***
m I **
a 0.75 I **
t I **
e I **
d I *
I **
P I **
r I *
o I **
p I *
o I **
r 0.50 I *
t I **
i I *
o I **
n I *
I **
C I *
o I **
r I **
r I *
e 0.25 I **
c I **
t I **
I ***
I **
I ***
I **
I
I------
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Ability
This test characteristic curve is the sum of all item characteristic curves (ICCs) for each individual. Therefore, this curve represents the sum of probabilities on all the items.