Running Head: MODERN MEASUREMENT

EDEP 651: Final Project 3

Running head: MODERN MEASUREMENT

Final Project: Scale Equating, Item Response Theory, and Differential Item Functioning

Paper submitted in partial fulfillment of requirements for

EDEP 651: Modern Measurement Applications for Education and Behavioral Sciences

Author: Faye Huie

December 12, 2010

GRADUATE SCHOOL OF EDUCATION

George Mason University

Fairfax VA

Dimiter Dimitrov, Ph.D., Instructor

Fall 2010

EDEP 651: Final Project 3

Introduction

The purpose of this final project is to synthesize and integrate the knowledge accumulated in EDEP 651: Modern Measurement.

Equating (pg 3) is a statistical process generating comparable scores on different tests. Specifically, equating attempts to address the question: can the scores from different tests measuring the same construct be comparable to each other? For example, if two forms of a test of math knowledge are administered to one class of students, we must be sure that differences in achievement on the two tests are in fact due to ability differences as opposed to difficulty differences between the items on the two forms. The method used to equate the two tests (Test Y and Test X) here is called the anchor/common item approach. This approach allows us to take common items from both the tests and base the difficulty and discrimination indices from test (X) to make it comparable with test (Y). This can be done because we assume that the parameters from the two forms are linearly related. Therefore, we can equate the two scales by using this following formula:

This is a the mathematical expression of the linear relationship between the Y and X common items. From this formula, we can calculate the constants alpha and beta for the common items on both test Y and X. Following, we use these two formulas to transform the difficulty and discrimination values on test X to scale it to test Y:

Difficulty:

Discrimination:

To determine the new difficulty and discrimination values for the anchor items. We just take the newly equated value and average it with the value on test Y. See page 3 for the results of equating the scores on both test Y and test X.

Item Response Theory (Pg 5) is a process through which we can examine the item parameters and person parameters on the same scale. This process thus shows that the probability for answering a question correctly given a person’s ability only depends on item difficulty.

Probability for correct response on item i = log (P/1-P) = θn – δi

Therefore, IRT addresses a major pitfall of CTT—where CTT is test dependent, IRT is not. See page 5 for further discussion of IRT as well as my interpretation of the output from a two parameter model.

Differential Item Functioning (pg 10) is a process through which we can examine issues of fairness. Specifically, items are functioning differentially if people have different probabilities for answering an item correctly when they have the same ability score. There are several different methods for assessing differential item functioning, but the most robust method is to use the simulated data method. Because the simulated data are based on item parameters from the entire group for both the focal and reference groups individually, simulated data has no DIF and any area inbetween the two curves represent random fluctuations. Therefore, we can be confident that the greatest area between the two simulated ICCs is the maximum value to use as the cut-off score for the actual data. Any item with an area surpassing the maximum cut off value would then be determined as differentially functioning. See page 10 for a fuller explanation as well as a step by step approach to the simulated data method of detecting DIF.

EDEP 651: Final Project 3

1) Equating

Equating is a statistical process that allows scores on one test to be on the same scale as the scores on another similar test. This allows researchers to combine two tests into one—even when they are on two separate scales. The only requirement in this case is for both tests to have “common/anchor items”. Common/anchor items are items that are similar for both tests. The anchor items provide a framework for scaling the different tests into one. For instance, students are typically separated into different classes within a school. If students from three fourth grade classes took three separate math tests, equating would be a process that would combine the results of the three tests to put the scores from the different tests on the same scale.

Please see the excel file to look at how I equated Tests X and Y specifically with the formulas

Item Number / Test_Ya / Test_Xa / Equated Values_Discrimination / Test_Yb / Test_Xb / Equated Values_Difficulty / Alpha value / Beta value
6 / 0.43 / 0.4687 / -0.77 / -0.9283 / 1.09170349 / -0.088812
7 / 0.63 / 0.6867 / 0.34 / 0.2816
8 / 0.58 / 0.6322 / -0.14 / -0.2416 / -0.222 = / Average difficulty on test Y common items
9 / 0.39 / 0.4251 / 0.87 / 0.8593 / -0.122 = / Average difficulty on test X common items
10 / 0.81 / 0.8829 / -0.96 / -1.1354
11 / 0.46 / 0.5014 / 0.55 / 0.5105
12 / 0.42 / 0.4578 / 0.14 / 0.0636
13 / 0.74 / 0.8066 / -0.84 / -1.0046
14 / 0.46 / 0.5014 / -0.85 / -1.0155
15 / 0.4 / 0.436 / 1.1 / 1.11
16 / 0.47 / 0.5123 / -0.09 / -0.1871
17 / 0.54 / 0.5886 / -0.63 / -0.7757
18 / 0.5 / 0.545 / -1.42 / -1.6368
19 / 0.43 / 0.4687 / Equated Scores for Anchor items / -0.11 / -0.2089 / Equated Scores for Anchor items
1c / 0.45 / 0.7 / 0.763 / 0.6065 / -1.17 / -0.88 / -1.0482 / -1.1091
2c / 0.44 / 0.51 / 0.5559 / 0.49795 / -2.5 / -2.25 / -2.5415 / -2.52075
3c / 0.33 / 0.42 / 0.4578 / 0.3939 / 0.14 / 0.17 / 0.0963 / 0.11815
4c / 0.33 / 0.4 / 0.436 / 0.383 / -0.17 / -0.11 / -0.2089 / -0.18945
5c / 0.36 / 0.38 / 0.4142 / 0.3871 / 2.59 / 2.46 / 2.5924 / 2.5912
0.5 / 0.5 / -0.62 / 0.5
0.53 / 0.53 / -1.01 / 0.53
0.4 / 0.4 / 0.23 / 0.4
0.36 / 0.36 / 1.96 / 0.36
0.41 / 0.41 / 0.39 / 0.41
0.41 / 0.41 / 1.95 / 0.41
0.7 / 0.7 / -0.31 / 0.7
0.44 / 0.44 / 0.87 / 0.44
0.76 / 0.76 / -0.25 / 0.76
0.49 / 0.49 / 1.04 / 0.49
0.58 / 0.58 / 1.07 / 0.58
0.53 / 0.53 / -0.9 / 0.53
0.42 / 0.42 / 1.65 / 0.42
0.48 / 0.48 / 0.45 / 0.48
0.45 / 0.45 / 0.11 / 0.45

EDEP 651: Final Project 3

2) IRT Analysis: Analyzing output

RASCH analysis and IRT analysis are similar, but conceptually different. Specifically, the goal of RASCH modeling is to selects item that align with ability levels. The assumption here is that we get the most accurate ability measures when their actual ability aligns with the item difficulty. Therefore, we would get no information in RASCH analysis if all students answered questions either correctly or incorrectly—we need an even number of students who get the answer both right and wrong so we can attain accurate person parameters. However, in IRT analysis, we are testing whether the model fits the data. This is why there is only one item parameter measured in RASCH analysis (item difficulty) where there can be three item parameters modeled in IRT analysis (difficulty, discrimination, and pseudo-guessing). IRT can then assess several different models to see how many parameters should be included the model that fits the data the best would then be used to analyze the items.

Please see below for the output from XCaliber and my interpretation of the results. All of my interpretations are in RED

First, we need to see if the theta scores are normally distributed or not with SPSS Q-Q Plot:

From here, we can see that the theta scores are indeed, normally distributed.

XCALIBRE (tm) for Windows -- Version 1.10 Page 2

Marginal Maximum-Likelihood IRT Parameter Estimation Program

Date: Dec 02, 2010 Time: 3:37 PM

The input was from file: C:\0\Z.DAT

The number of items was: 34

There was no item linkage

The key was:

1111111111111111111111111111111111

The numbers of alternatives were:

2222222222222222222222222222222222

The inclusion specifications were:

YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

The maximum parameter change on loop 1 was 0.464

The maximum parameter change on loop 2 was 0.328

The maximum parameter change on loop 3 was 0.123

The maximum parameter change on loop 4 was 0.060

The maximum parameter change on loop 5 was 0.035

Mean Number-Correct Score = 16.979

Number-Correct Standard Deviation = 5.535

K-R 21 Reliability = 0.770

The number of examinees was 703

Final Parameter Summary Information:

Mean SD

Theta 0.00 1.00

a 0.46 0.10

b 0.09 1.05

c 0.00 0.00

XCALIBRE (tm) for Windows -- Version 1.10 Page 3

Marginal Maximum-Likelihood IRT Parameter Estimation Program

Date: Dec 02, 2010 Time: 3:37 PM

FINAL ITEM PARAMETER ESTIMATES

Item Lnk Flg a b c Resid PC PBs PBt N Item name

------

1 0.57 -0.99 0.00 0.55 0.67 0.43 0.45 703

2 0.48 -2.30 0.00 0.53 0.83 0.30 0.30 703

3 0.36 0.16 0.00 1.21 0.48 0.25 0.25 703

4 0.37 -0.14 0.00 1.14 0.52 0.25 0.23 703

5 0.36 2.55 0.00 1.65 0.20 0.16 0.16 703

6 0.40 -0.80 0.00 0.97 0.61 0.31 0.29 703

7 0.54 0.35 0.00 0.53 0.44 0.43 0.44 703

8 0.53 -0.18 0.00 0.67 0.53 0.42 0.42 703

9 0.37 0.88 0.00 0.98 0.39 0.27 0.26 703

10 0.70 -1.03 0.00 0.51 0.70 0.49 0.52 703

11 0.40 0.58 0.00 0.67 0.42 0.31 0.30 703

12 0.38 0.13 0.00 1.31 0.49 0.27 0.26 703

13 0.63 -0.91 0.00 0.44 0.67 0.47 0.49 703

14 0.44 -0.86 0.00 0.51 0.63 0.36 0.35 703

15 0.37 1.17 0.00 1.24 0.35 0.25 0.25 703

16 0.43 -0.12 0.00 0.56 0.52 0.35 0.33 703

17 0.54 -0.66 0.00 0.82 0.61 0.42 0.43 703

18 0.50 -1.38 0.00 0.56 0.72 0.38 0.38 703

19 0.43 -0.13 0.00 0.61 0.52 0.34 0.33 703

20 0.48 -0.64 0.00 0.82 0.60 0.38 0.38 703

21 0.49 -1.03 0.00 0.60 0.66 0.39 0.39 703

22 0.36 0.27 0.00 1.06 0.47 0.26 0.24 703

23 0.34 1.99 0.00 1.25 0.27 0.19 0.17 703

24 0.36 0.42 0.00 1.08 0.45 0.26 0.24 703

25 0.42 1.89 0.00 0.86 0.24 0.28 0.28 703

26 0.67 -0.29 0.00 0.28 0.56 0.50 0.52 703

27 0.39 0.90 0.00 0.71 0.38 0.30 0.29 703

28 0.66 -0.25 0.00 0.62 0.55 0.50 0.52 703

29 0.47 1.04 0.00 0.55 0.34 0.37 0.37 703

30 0.55 1.12 0.00 0.78 0.31 0.42 0.43 703

31 0.44 -1.00 0.00 0.79 0.65 0.35 0.34 703

32 0.39 1.71 0.00 0.71 0.27 0.28 0.27 703

33 0.42 0.49 0.00 0.59 0.43 0.33 0.33 703

34 0.40 0.15 0.00 0.74 0.48 0.30 0.29 703

Test characteristics: K-R 21 Expected Average

Reliability Information Information

0.770 4.242 3.431

Page 6

Test Information Curve

8.0 I

6.0 I

I I

n I

f I ********

o I *** | ***

r I ** | **

m I * | **

a 4.0 I ** | **

t I * | **

i I ** | *

o I * | **

n I * | **

I ** | **

I * | *

I ** | **

I * | **

I ** | **

2.0 I * | **

I * | **

I | *

I------

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

ability

This test information curve is the sum of all item information curves. The highest point of this curve indicates the smallest standard error for the estimation of theta. For example, in this test, we can see that the highest point of the curve is when ability is about between -1 and 0. This indicates that this test provides the most accurate information for students with an ability score of -1 and 0. Therefore, this point would be good to use as a cutting score for criterion referenced tests—people above 0 would pass the test while people below -1 would fail the test. The highest accuracy would be provided at this cutting score.

Test Characteristic Curve

1.00 I

E I **

s I ***

t I ***

i I ***

m I **

a 0.75 I **

t I **

e I **

d I *

I **

P I **

r I *

o I **

p I *

o I **

r 0.50 I *

t I **

i I *

o I **

n I *

I **

C I *

o I **

r I **

r I *

e 0.25 I **

c I **

t I **

I ***

I **

I ***

I **

I------

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

Ability

This test characteristic curve is the sum of all item characteristic curves (ICCs) for each individual. Therefore, this curve represents the sum of probabilities on all the items.