Lecture 7: Item Analysis

  1. Methods of Objective (Selected Response) Item Analysis:

A.Empirical (quantitative) analysis of selected response items

  1. Difficulty (p): The proportion of students who get an item correct.
  1. Discrimination (d): The difference between proportion correct in the top half of the class and the proportion correct for the bottom half of the class.
  1. Judgmental (qualitative) analysis of items
  1. Yourself
  1. Unclear/Incomplete directions
  2. Reading vocabulary too difficult
  3. Complicated syntax
  4. Ambiguous items
  5. Inadequate time limits
  6. Inappropriate level of difficulty of the test items
  7. Poorly constructed test items
  8. Improper arrangement of items
  9. Unintended Clues
  1. Your colleagues
  2. Find a colleague to share this task. You check his/her exams and he/she checks yours. Use the above criteria.
  1. Your students
  2. Ask students who do well in the subject to give you feedback about the clarity of directions and items, usefulness of distractors, timing, difficulty, and any other issues you may question about the performance of your test.

II.The 3 Quality Characteristics of Items

B.Difficulty for the group:
C.Ability of the item to discriminate:
D.Plausibility of the distractors:

III.Item Difficulty Analysis

Item Difficulty Index, p - proportion (or percentage if you multiply p by 100) of students who answer the item correctly out of all those who responded to the item.

  1. p ranges in value between 0 and +1
  1. The closer p is to +1, the easier the item is. So you want to think of the difficulty index as an easiness index.
  1. Your teacher made tests should have items at different levels of difficulty, some should be easy, some moderately easy, some fairly difficult, and, for more advanced students, some very difficult items.
  1. Standardized Norm-referenced tests are written to be more difficult than teacher made tests. Items for these tests are selected on the basis that 50-55% of the students can answer them correctly, no more and no less.

Interpretation* / Very
Difficult / Fairly
Difficult / Moderately
Easy / Very
Easy
Percentage who got the item correct / 0% - 49% / 50% - 69% / 70% - 89% / 90% - 100%

*Appropriate interpretations for elementary and secondary students taking teacher-made criterion referenced tests.

Calculating Item Difficulty Indices

R = # of students who answered the item RIGHT

N = total NUMBER of students who attempted the item

Item / R / N / p = (R/N)
(proportion) / p x 100
(percent)
1 / 30 / 50 / 30/50 = .60 / 60%
2 / 40 / 50 / 40/50 = .80 / 80%
3 / 10 / 50 / 10/50 = .20 / 20%
4 / 48 / 50 / 48/50 = .96 / 96%
5 / 25 / 50
6 / 45 / 50
7 / 0 / 50
8 / 50 / 50

Now you calculate the last four difficulty indices and answer the following questions.

_____Which item is the easiest?

_____Which item is the most difficult?

_____Which item would be most likely to be included on a norm-referenced test?
B.Evaluating Items That Seem to Have an Unreasonable

(too easy or too difficult) Difficulty Index

  1. Skill Complexity
  1. Group’s Achievement Characteristics
  1. Multiple Items for the Same Objective

4.Hierarchically Related Items

  1. Item Discrimination Analysis

Item Discrimination Index, d–This index tells how well the item functions to separate students into two groups, those who have achieved the objectives and those who have not. In the context of Psychometrics, discrimination is a desirable quality of test items.

  1. d ranges in value between -1 and +1
  1. There are three kinds of discriminators
  2. Negative discriminators (This is never what we want)
  1. Non-discriminators (This may or may not be what we want)
  1. Positive discriminators (This is usually what we want)
  1. If d = -1.00 = the item has maximum negative discrimination between groups which means that people with low scores got it right and people with high scores missed the item (a bad thing)

If d = 0.00 then the item is not helping us sort people into those two performance groups at all. This can happen when either everyone gets the item right or everyone misses it. These are non-discriminators.

If d = +1.00 = the item has maximum positive discrimination between groups. This means that the item is helping you identify those students who have achieved your objectives (and those who have not!).

  1. Your teacher made tests should have items with different levels of discrimination. However, none of your items should be negative discriminators. If you teach for mastery, many of your items may be non-discriminators.
  1. Standardized Norm-referenced and Criterion-referenced tests are written to be more discriminating than teacher made tests. Items for these tests are designed to have discrimination indices of +.50 and above, making them very discriminating. (That is why students dislike them so much.)

Calculating Item Discrimination (d)

  1. Identify the groups: Put the tests in order from the lowest score to the highest score. Then divide the tests in ½. If the two halves are not equal in number, put the extra test in the high group.
  1. Upper group - Students scoring in the top half of the class. The upper group consists of those tests in the top half of your class (for this test). This doesn’t mean that these students are better, or smatter, they simply achieved better than the other half.

b.Lower group - Students scoring in the lower half of the class. The lower group consists of those tests in the bottom half of your class (for this test). Again, we aren’t judging these students as lesser, they just didn’t achieve as well as the other half on this particular test.

  1. Calculate a difficulty index (p) for each group:

a.pupper = “Proportion in upper group who got it right”

“# in the upper group who got it right / # of students in upper group who answered the item”

b.pLower = “Proportion in lower group who got it right”

“# in lower group who got it right / # of students in lower group who answered the item”

3.Find the difference between the two p values

a.d = pupper - pLower

The difference between the proportions indicates whether the item was easier for the upper subgroup than for the lower subgroup, as we would expect. This process is based on the assumption that those who do well on the test should also do well on individual items.

This process is not used to label or track students. It is merely a way to determine how well the items serve their purpose.

  1. Practice calculating discrimination

N = 20, n = 10

ABC*D

Upper group002 8

Lower group242 2

*Denotes correct response

To find p we take the total who got it correct divided by N (we will assume that everyone attempted the item)

10/20 = .50 = p

To find d, we first find p for each group and then subtract the lower group from the upper group. p for each group is found by the number in that group who got it correct divided by n.

p(upper) = 8/10 = .80

p(lower) = 2/10 = .20d = .80 - .20 = .60

Try another one:

N = 20, n = 10

A*BCD

Upper group1711

Lower group4015

p(upper) = 7/10 = .70

p(lower) = 0/10 = .00d = .70 - .00 = .70

N = 30, n = 15

AB*CD

Upper group40101

Lower group4065

p(upper) = 10/15 = .67

p(lower) = 6/15 = .40d = .67 - .40 = .27

p = 16/30 = .53

N = 30, n = 15

*ABCD

Upper group3822

Lower group11013

p(upper) =

p(lower) = d =

p =

N = 22, n = 11

A*BCD

Upper group11000

Lower group3431

p(upper) =

p(lower) = d =

p =

N = 26, n = 13

ABC*D

Upper group2560

Lower group4360

p(upper) =

p(lower) =d =

p =

C.Evaluating Test Items Using Difficulty and Discrimination Indices

You need to establish minimally acceptable criteria for the indices, p and d.

1.Typically, a minimum difficulty level of .50 is satisfactory for elementary and secondary students if:

  1. To establish standards for item discrimination, you need to consider the difficulty of the test items. The difficulty will limit the possible values for d.
p-values / Expectations for d-values
< .30
.30-.70 / Any positive #
d .4
.71-.90 / d .2
.91 / Any positive #

V.Item Distractor Analysis

A.Potential Problems with Distractors: (or, how students identify your distractors!)

1) Silliness:

2) Unfamiliar information or terminology:

3) Grammatical mismatches between stem & distractors:

4) Distractor length:

Rules of Thumb for analyzing your distractors:

  1. When no one in the upper group chooses a distractor we don’t worry about it. They know more and it wouldn’t be unusual if everyone in the upper group got the item correct.
  1. When no one in the lower group chooses a distractor, it isn’t doing its job. If this happens you should work on a new distractor.
  1. If no one in the lower group chooses a distractor, but people in the upper group do, you should ask the upper group why they chose it and use that information to improve the item.

B.Analyzing distractors

N = 20a.*b.c.d.

upper2521n = 10

lower3151n = 10

p = 6/20 = .30

d = (5/10 – 1/10) = .5 - .1 = .4

All distractors are being chosen, difficulty is somewhat low but discrimination is within the acceptable range. We probably want to make this item easier.

N = 20a.b.*c.d.

upper00100n = 10

lower3151n = 10

p = 15/20 = .75

d = (10/10 – 5/10) = 1 - .5 = .5

The lower group is choosing all distractors. No one in the upper group has chosen the distractors, but we expect them to know more, so that is not problematic. Difficulty is within the acceptable range and discrimination is above what we would expect for an item this easy. The item is fine.

N = 20*a.b.c.d.

upper2080n = 10

lower7111n = 10

p = 9/20 = .45

d = (2/10 – 7/10) = .2 - .7 = -.5

The lower group has chosen all distractors. However, there is an obvious problem in the upper group. Eight of them chose distractor C and only two chose the correct answer. More people in the lower group got the item correct, making discrimination negative, which is unacceptable. Distractor C, being chosen overwhelmingly by the upper group, may well have an aspect of correctness that the lower group does not recognize. It should be eliminated and the item discarded or rewritten.

N = 20a.b.c.*d.

upper00010n = 10

lower0190n = 10

p = 10/20 = .5

d = (10/10 – 0/10) = 1 - 0 = 1

This item is a perfect positive discriminator. Everyone in the upper group got it right, everyone in the lower group got it wrong. Difficulty is acceptable. However, no one in the lower group chose distractor A, which means that it serve no purpose. We should replace it with a more plausible distractor.

1