Michael J. Walk

Modern Measurement Theories

Homework #8

13 April 2008

Question 1

As requested, two simulated data sets were created, one with 500 examinees and one with 5000 examinees; both sets of examinees were random samples from a normal θ distribution with a μ = .75 and σ = 1.00. For all items, difficulty parameters (β) were sampled randomly from a uniform distribution with a range from –4 to +4. Twenty of the items (items 1 – 20) were generated using a Rasch model with all discriminations constrained to α = 1.00 and gamma (i.e., guessing) parameters were constrained to γ = 0.00. Twenty of the items (items 21 – 40) were generated using a 2PL model with discriminations sampled from a normal distribution with a μ = 1.25 and σ = 1.00; gamma parameters were constrained to γ = 0.00. Twenty of the items (items 41 – 60) were generated using a 3PL model with the same distribution of discrimination parameters; however, gamma parameters were randomly sampled from a normal distribution with a μ = .20 and σ = .05. Item parameters for both data sets are included in Appendix A.

(a)

A Rasch, a 2PL, and a 3PL model were then fit to both data sets using the ltm package for R. For the 500-case data set, results of likelihood ratio tests (LRT) are presented in Table 1 and Table 2 and indicate that the 2PL provides better data fit than the Rasch model. This is to be expected since two-thirds of the items contained freely varying discrimination and difficulty parameters. However, the 3PL does not significantly fit the data better than the 2PL. In fact, the fit of the two models is almost exactly the same. Since the 2PL is more parsimonious than the 3PL, the 2PL was retained as the best-fitting model.

Table 1

Likelihood Ratio Test Results Comparing Data Fit for the Rasch and the 2PL – 500 examinees

AIC BIC log.Lik LRT df p.value

Rasch21332.1021589.19-10605.05

2PL19974.73 20480.49 -9867.37 1475.3659 <0.001

Table 2

Likelihood Ratio Test Results Comparing Data Fit for the 2PL and the 3PL – 500 examinees

AIC BIC log.Lik LRT df p.value

2PL19974.7320480.49-9867.37

3PL 20072.2820830.91-9856.1422.4560 1

For the 5000-case data set, results are presented in Table 3 and Table 4. Examination of the likelihood ratio tests (LRT) reveal that, as with the 500-case data set, the 2PL proves to be the best-fitting and most parsimonious model. It fits the data significantly better than the Rasch model, and the 3PL does not significantly improve data fit over the 2PL.

Table 3

Likelihood Ratio Test Results Comparing Data Fit for the Rasch and the 2PL – 5000 examinees

AICBIC log.Lik LRT df p.value

Rasch241123.6241521.1-120500.8

2PL230967.8231749.9 -115363.9 10273.8 59 <0.001

Table 4

Likelihood Ratio Test Results Comparing Data Fit for the 2PL and the 3PL – 5000 examinees

AICBIC log.LikLRTdfp.value

2PL230967.8231749.9 -115363.9

3PL231077.2232250.3 -115358.6 10.62 60 1

(b)

500 cases – 2PL

The 500-case data set, the instrument definitely appears to function poorly. Examination of the ICCs reveals that most of the items are very poorly discriminating. The plot of IICs reveals the same story. The test is dominated by a few particular items that are highly discriminating and provide a lot of information. However, the rest of the test barely even approaches the degree of precision of those few items. Plotting the IICs individually reveals that items 23, 25, 27, 29, 35, 36, and 40 are the most highly discriminating items with α parameters of 3.40, 2.04, 1.87, 3.04, 1.35, 3.08, 1.68, respectively. The TIC reflects the dominance of these items as well—the test provides very precise information at only a few trait levels, and then, over the rest of the trait levels, the test fails to provide much meaningful information.

It is also important to note that the items that appear to perform the best (according the ICCs) are those that were generated using the 2PL model. Items generated using the Rasch and, especially, the 3PL model do not perform well when the 2PL model is fit to the data (i.e., an item’s parameters work best when the model generating the data match the model placed on the data).

5000 cases – 2PL

The plots for the 5000-case data set reveal a much different picture than the plots for the 500-case data set. Specifically, the items as a whole provide a much broader range of information about trait level as can be seen from examining both the IICs and the TIC. The ICCs do show, however, that there are several items which do not discriminate well intermingling with a set of items that do discriminate well. The items whose response functions are most informative about trait level are those generated under the same model as was fitted to the data (i.e., the 2PL).

(c-d)

In the section (c), item- and person-fit indices were calculated in R. An alpha level of α = .05 was used as the cutoff value to indicate misfit.

500 cases

Item Fit

Examining the item-fit statistics for the 500-case data set when fit to the Rasch model indicated that 5 items (i.e., 8, 25, 27, 36, and 40) poorly fit the model. When the 2PL was fit to the data, the item-fit statistics indicated that only one item (i.e., item 40) poorly fit the model. When the 3PL was fit to the model, the item-fit statistics indicated that three items (i.e., items 23, 27, and 40) poorly fit the model..

Person Fit

When the Rasch model was fit to the data, 54 individuals were found to not fit the model well. When the 2PL model was fit to the data, 37 individuals were found to not fit the model well. When the 3PL model was fit to the data, 41 individuals were found to not fit the model well.

Summary

Because the 2PL was the model found to best fit the data, it is not a surprising to find that the 2PL model also produces the least number of poorly fitting items. It is interesting to note that, when fitting the three models, none of items 41 – 60 (items generated using the 3PL model) suggested misfit, and only one of items 1 – 20 (items generated using the Rasch model) suggested misfit. The absence of any misfitting items in the 41 – 60 set suggests that, when item characteristic functions are more complex, they provide more mathematical flexibility and less chance to detect any problems with the items. That is, no matter what model is fit to the data, items generated using a 3PL model are specifically designed to be more flexible and be able to accommodate more complex response patterns.

The same results were found concerning person fit. The number of persons found to be misfitting when the Rasch, 2PL, and 3PL models were fit to the data were 54, 37, and 41, respectively. The 2PL and 3PL models were relatively equivalent regarding the degree of person misfit; however, they were much better than the Rasch model (which was the most poorly fitting model).

In summary, the item- and person-fit statistics support the picture provided by the relative fit indices (-2LogLikelihoods) and LRTs. In addition, the item- and person-fit statistics give more detailed information as to where problems in model fit may be found.

5000 cases

Item Fit

When the Rasch model was fit to the data, three items (i.e., 27, 43, and 54) failed to fit the model well. The item-fit statistics indicated that, when the 2PL model was fit to the data, no items fit poorly. The same results were obtained from fitting the 3PL model to the data—that is, no items demonstrated poor fit.

Person Fit

Due to R’s max.print limitation, I could not report the person-fit statistics for the data set of 5000 examinees. Less than half of the cases were displayed on the console, failing to give adequate information to make statements about person fit. However, for the sake of argument, I have provided the information I was able to obtain.

When the Rasch model was fit to the data, of the 1285 cases available, 62 (4.82%) of them were found to demonstrate poor person-fit. When the 2PL was fit to the data, of the 1587 cases available, 86 (5.41%) demonstrated poor person-fit. When the 3PL was fit to the data, 82 (5.12%) of the 1587 available cases demonstrated poor person-fit. (Note: The actual cases available in R were not the same when the Rasch model was examined for person-fit and when the 2PL and 3PL models were examined.)

Summary

The results from both the item- and person-fit analyses did not confirm my expectations. That is, I expected that the 2PL and 3PL models should provide better item- and person-fit than the Rasch, given that a majority of the items were generated using at least two parameters (i.e., difficulty and discrimination).While, across models, there was very little difference in the percentage of poorly fitting examinees, it is difficult to tell how accurate these frequency estimates are given that they are only samples of all examinees in the data set.

Question 2

(a)

The following two tables present the item parameter point estimates generated when a Rasch, 2PL, 3PL, 2-dimensional non-compensatory IRT model (MIRT), and Mokken Monotone Homogeneity Model (MHM) were fit to two different data sets (one with mild multidimensionality—2 dimensions—and one with strong multidimensionality—5 dimensions).

Parameter estimates from different models have been placed side-by-side in order to make comparisons easier. Standard errors for the parameters from the different models have been placed side-by-side below the parameter estimates for the same reason. (Cells with “---“ in them indicate that the cell value was not able to be calculated.)
Item parameter estimates and standard errors for the data set generated by two latent dimensions.

Beta (Difficulty) / Alpha (Discrimination)
ITEM / Rasch / 2PL / 3PL / MIRT (intercept) / Rasch / 2PL / 3PL / MIRT (a1) / MIRT (a2)
1 / 1.50 / 0.48 / 0.49 / -1.32 / 0.46 / 2.79 / 2.73 / 2.7782 / -0.1518
2 / -0.52 / -9.22 / -7.17 / 0.23 / 0.46 / 0.02 / 0.02 / 0.0353 / 0.1833
3 / -1.35 / -5.70 / -5.49 / 0.59 / 0.46 / 0.10 / 0.11 / 0.1061 / 0.0381
4 / 0.35 / 0.15 / 0.16 / -0.21 / 0.46 / 1.36 / 1.34 / 1.3785 / 0.1224
5 / -0.86 / -0.47 / -0.32 / 0.64 / 0.46 / 0.95 / 1.01 / 1.3144 / -1.5801
6 / -0.12 / -0.04 / -0.03 / 0.10 / 0.46 / 2.16 / 2.14 / 2.1719 / -0.0039
7 / 2.61 / 0.79 / 0.74 / -2.64 / 0.46 / 3.40 / 10.94 / 3.3336 / 0.1042
8 / -2.04 / -0.91 / -0.91 / 1.20 / 0.46 / 1.29 / 1.29 / 1.2924 / -0.389
9 / -1.95 / 1.85 / 1.90 / 0.91 / 0.46 / -0.49 / -0.47 / -0.5021 / -0.1769
10 / 3.99 / -1.99 / -2.00 / -2.17 / 0.46 / -1.06 / -1.05 / -1.1053 / -0.251
11 / 1.31 / 0.53 / 0.54 / -0.82 / 0.46 / 1.52 / 1.50 / 1.5642 / 0.224
12 / -1.14 / 0.59 / 0.08 / 0.61 / 0.46 / -1.04 / -1.40 / -1.0365 / 0.2648
13 / -3.28 / -4.24 / -0.68 / 1.53 / 0.46 / 0.35 / 0.50 / 0.3784 / 0.3371
14 / 1.69 / -2.02 / -2.03 / -0.77 / 0.46 / -0.38 / -0.38 / -0.3728 / 0.1674
15 / 1.95 / 0.93 / 0.94 / -1.08 / 0.46 / 1.16 / 1.17 / 1.1717 / 0.0811
16 / 2.83 / -1.42 / -1.41 / -1.52 / 0.46 / -1.07 / -1.08 / -1.0659 / 0.0739
17 / -4.58 / -1.66 / -1.63 / 3.07 / 0.46 / 1.82 / 1.87 / 1.8667 / 0.1404
18 / -1.28 / -0.61 / -0.60 / 0.71 / 0.46 / 1.15 / 1.16 / 1.1472 / -0.0896
19 / -1.76 / -2.50 / 0.99 / 0.82 / 0.46 / 0.32 / 0.64 / 0.3075 / -0.3807
20 / 0.12 / 0.17 / 0.36 / -0.05 / 0.46 / 0.33 / 0.33 / 0.3177 / -0.0892
Standard Error / Standard Error
Rasch / 2PL / 3PL / MIRT (intercept) / Rasch / 2PL / 3PL / MIRT (a1) / MIRT (a2)
1 / 0.17 / 0.05 / 0.05 / 0.1508 / 0.02 / 0.27 / 0.26 / 0.26 / ---
2 / 0.15 / 26.48 / 235.58 / 0.0643 / 0.02 / 0.07 / 0.09 / --- / 0.14
3 / 0.16 / 4.05 / 4.74 / 0.0663 / 0.02 / 0.07 / 0.07 / 0.07 / 0.12
4 / 0.15 / 0.06 / 0.04 / 0.0825 / 0.02 / 0.12 / 0.10 / 0.10 / ---
5 / 0.15 / 0.09 / 0.40 / 0.2138 / 0.02 / 0.10 / 0.21 / --- / 0.83
6 / 0.14 / 0.05 / 0.05 / 0.1008 / 0.02 / 0.19 / 0.18 / 0.19 / ---
7 / 0.21 / 0.05 / --- / 0.1801 / 0.02 / 0.38 / --- / 0.32 / ---
8 / 0.19 / 0.09 / 0.09 / 0.1027 / 0.02 / 0.12 / 0.12 / --- / ---
9 / 0.18 / 0.32 / 0.35 / 0.0749 / 0.02 / 0.08 / 0.08 / --- / ---
10 / 0.28 / 0.20 / 0.20 / 0.135 / 0.02 / 0.13 / 0.13 / --- / ---
11 / 0.16 / 0.07 / 0.07 / 0.0984 / 0.02 / 0.13 / 0.13 / --- / ---
12 / 0.16 / 0.09 / 0.26 / 0.0811 / 0.02 / 0.10 / 0.31 / --- / ---
13 / 0.24 / 1.09 / 6.73 / 0.0929 / 0.02 / 0.09 / 0.67 / --- / ---
14 / 0.17 / 0.44 / 0.45 / 0.0708 / 0.02 / 0.08 / 0.08 / --- / ---
15 / 0.18 / 0.10 / 0.09 / 0.0924 / 0.02 / 0.11 / 0.11 / 0.10 / ---
16 / 0.22 / 0.14 / 0.14 / 0.1 / 0.02 / 0.12 / 0.12 / 0.12 / ---
17 / 0.31 / 0.12 / 0.12 / 0.2324 / 0.02 / 0.21 / 0.22 / 0.20 / ---
18 / 0.16 / 0.08 / 0.08 / 0.0825 / 0.02 / 0.11 / 0.11 / 0.11 / ---
19 / 0.18 / 0.63 / 1.21 / 0.0743 / 0.02 / 0.08 / 0.45 / --- / ---
20 / 0.14 / 0.20 / 4.52 / 0.0648 / 0.02 / 0.07 / 0.30 / 0.06 / ---

Item parameter estimates and standard errors for the data set generated by five latent dimensions

Beta (Difficulty) / Alpha (Discrimination)
ITEM / Rasch / 2PL / 3PL / MIRT (intercept) / Rasch / 2PL / 3PL / MIRT (a1) / MIRT (a2)
1 / 0.00 / 0.02 / 0.03 / -0.01 / 0.48 / 1.93 / 1.93 / 1.84 / -0.58
2 / 1.74 / 0.70 / 0.71 / -1.23 / 0.48 / 1.79 / 1.80 / 1.71 / -0.59
3 / -1.32 / 0.48 / 0.49 / 1.13 / 0.48 / -2.44 / -2.44 / -2.36 / 0.63
4 / -0.92 / 0.53 / 0.35 / 0.51 / 0.48 / -0.99 / -1.07 / -0.97 / 0.21
5 / -2.47 / -0.79 / -0.71 / 8.90 / 0.48 / 2.91 / 3.16 / 9.97 / -9.61
6 / 0.21 / 0.15 / 0.15 / -0.11 / 0.48 / 0.87 / 0.87 / 0.88 / -0.10
7 / 1.38 / 0.61 / 0.67 / -0.91 / 0.48 / 1.51 / 1.72 / 1.58 / -0.13
8 / -0.23 / 0.25 / 0.15 / 0.11 / 0.48 / -0.48 / -0.49 / -0.50 / 0.06
9 / 1.58 / 0.60 / 0.60 / -1.26 / 0.48 / 2.13 / 2.14 / 2.17 / -0.35
10 / -0.03 / 0.01 / 0.02 / 0.03 / 0.48 / 2.98 / 2.99 / 2.83 / -1.13
11 / 3.08 / 1.58 / 1.54 / -1.75 / 0.48 / 1.11 / 1.47 / 1.13 / -0.14
12 / 2.33 / -17.85 / -17.46 / -1.07 / 0.48 / -0.06 / -0.09 / -0.05 / 0.03
13 / -0.07 / 0.06 / 0.06 / 0.05 / 0.48 / -1.11 / -1.11 / -1.13 / 0.16
14 / -0.09 / -0.04 / -0.04 / 0.05 / 0.48 / 0.78 / 0.78 / 0.85 / 0.05
15 / -1.59 / -0.50 / -0.49 / 1.66 / 0.48 / 3.08 / 3.05 / 3.21 / -0.60
16 / 0.94 / 0.33 / 0.34 / -1.02 / 0.48 / 3.16 / 3.19 / 3.05 / -1.38
17 / -0.34 / 0.16 / 0.16 / 0.24 / 0.48 / -1.73 / -1.73 / -1.69 / 0.41
18 / 1.26 / 0.72 / 0.72 / -0.69 / 0.48 / 0.99 / 1.00 / 0.93 / -0.35
19 / -0.57 / -0.20 / -0.08 / 0.41 / 0.48 / 1.87 / 2.13 / 1.90 / -0.33
20 / -2.65 / -2.93 / -2.87 / 1.27 / 0.48 / 0.43 / 0.44 / 0.46 / 0.04
Standard Error / Standard Error
Rasch / 2PL / 3PL / MIRT (intercept) / Rasch / 2PL / 3PL / MIRT (a1) / MIRT (a2)
1 / 0.14 / 0.05 / 0.05 / 0.10 / 0.02 / 0.15 / 0.15 / 0.16 / 0.27
2 / 0.17 / 0.06 / 0.06 / 0.12 / 0.02 / 0.15 / 0.15 / 0.16 / 0.26
3 / 0.16 / 0.05 / 0.05 / 0.13 / 0.02 / 0.19 / 0.19 / 0.22 / 0.34
4 / 0.15 / 0.09 / 0.39 / 0.08 / 0.02 / 0.09 / 0.22 / 0.11 / 0.18
5 / 0.19 / 0.06 / 0.11 / 13.36 / 0.02 / 0.28 / 0.47 / 16.68 / 16.55
6 / 0.14 / 0.09 / 0.09 / 0.07 / 0.02 / 0.09 / 0.09 / 0.10 / 0.18
7 / 0.16 / 0.07 / 0.08 / 0.10 / 0.02 / 0.12 / 0.24 / 0.16 / 0.28
8 / 0.14 / 0.14 / 1.46 / 0.07 / 0.02 / 0.07 / 0.19 / 0.08 / 0.14
9 / 0.16 / 0.06 / 0.06 / 0.13 / 0.02 / 0.17 / 0.17 / 0.22 / 0.32
10 / 0.14 / 0.05 / 0.05 / 0.14 / 0.02 / 0.24 / 0.24 / 0.26 / 0.39
11 / 0.22 / 0.14 / 0.12 / 0.11 / 0.02 / 0.12 / 0.27 / 0.14 / 0.22
12 / 0.19 / 23.14 / --- / 0.07 / 0.02 / 0.08 / --- / 0.09 / 0.14
13 / 0.14 / 0.07 / 0.07 / 0.08 / 0.02 / 0.10 / 0.10 / 0.11 / 0.19
14 / 0.14 / 0.09 / 0.09 / 0.07 / 0.02 / 0.08 / 0.08 / 0.10 / 0.17
15 / 0.16 / 0.05 / 0.05 / 0.19 / 0.02 / 0.26 / 0.26 / 0.32 / 0.40
16 / 0.15 / 0.05 / 0.04 / 0.17 / 0.02 / 0.27 / 0.27 / 0.32 / 0.50
17 / 0.14 / 0.06 / 0.05 / 0.10 / 0.02 / 0.13 / 0.13 / 0.15 / 0.23
18 / 0.15 / 0.09 / 0.09 / 0.08 / 0.02 / 0.10 / 0.10 / 0.11 / 0.19
19 / 0.14 / 0.05 / 0.11 / 0.10 / 0.02 / 0.15 / 0.30 / 0.17 / 0.27
20 / 0.20 / 0.57 / 0.60 / 0.08 / 0.02 / 0.09 / 0.09 / 0.10 / 0.16

Relative fit indices for the data set with two latent dimensions.

Fit Index
Model / log.Lik / AIC / BIC
Rasch / -11997.45 / 24036.91 / 24139.95
2PL / -11039.32 / 22158.65 / 22354.92
3PL / -11035.53 / 22191.06 / 22485.47
MIRT / -11020.94 / 22161.89 / 22456.29
H-coefficient
Mokken / 0.065

Relative fit indices for the data set with five latent dimensions.

Fit Index
Model / log.Lik / AIC / BIC
Rasch / -12713.57 / 25469.15 / 25572.19
2PL / -10689.97 / 21459.95 / 21656.22
3PL / -10686.83 / 21493.67 / 21788.07
MIRT / -10673.39 / 21466.77 / 21761.18
H-coefficient
Mokken / 0.07

(b)

It is difficult to see any particular pattern in the item parameter estimates that will uncover the sensitivity of the unidimensional models to violations of unidimensionality. I first suspected that perhaps, when strong multidimensionality was present, item parameter standard errors for unidimensional models would be larger than when multidimensionality was either weak or absent. However, examination of the standard errors does not support this conclusion.

I examined the item parameters in several other ways and noticed only this: that when multidimensionality is weak, the discrimination parameters for the 2PL and 3PL are quite close to the dimension 1 discrimination parameter for the MIRT. However, when multidimensionality is strong, there seems to be greater variation between the unidimensional discrimination parameters and the dimension 1 discrimination parameter for the MIRT. No matter how many dimensions were present, the item parameters themselves did not provide sufficient information to detect multidimensionality. When multidimensionality was weak, the multidimensional model parameter estimates were quite similar to the unidimensional model estimates across almost all items. When multidimensionality was strong, although unidimensional models could not detect it; the multidimensional models could—we can see this in the difference between parameter estimates for the unidimensional models and the estimates for the MIRT when data set multidimensionality was strong.

I also included the relative fit indices for the respective models. One can see that, while the fit of the unidimensional models levels off at the 3PL (i.e., changes in likelihoods get smaller as one fits the Rasch, then the 2PL, and then the 3PL), allowing the model to be bi-dimensional caused a larger increase in model fit. I conducted a likelihood ratio test between the 2PL and the MIRT model (from my understanding these are nested models while the 3PL and the MIRT are not). Results for the mild-multidimensional data set indicated that the MIRT model significantly improved the data fit, χ2(20) = 36.76, p < .05. Results for the strong-multidimensional data set were the same: the MIRT model significantly improved data fit over the 2PL model, χ2(20) = 33.17, p < .05. While the item parameters themselves seem to not uncover information about the dimensionality of the data, relative fit tests can provide information as to whether a unidimensional or multidimensional model bests fits the data, albeit relatively.

In addition, in both data sets, the Mokken H-coefficient was very low, indicating that the 20 items did not fit the MHM model. It seems to me that the MHM model was much more sensitive to the violation of unidimensionality than the parametric models.

Question 3

(a)

Dealing With Unknowns

In Joint Maximum Likelihood (JML), the problem of unknown person parameters is taken care of by beginning with a provisional set of parameters in order to estimate an initial set of item parameters. (These provisional person parameters are created by first setting starting values for the item parameters.) Once the first set of item parameters are estimated, the person parameters are then improved. Then the item parameters are improved, and the process continues over and over again until a convergence criterion is met (i.e., successive iterations produce very little change in the item parameters).

In Conditional Maximum Likelihood (CML), there is no problem of unknown person parameters, because the person parameters are obtained from the data. That is, an individual’s total score is used as a proxy for θ. These θ values are then used to estimate item parameters. Once acceptable item parameters are found, trait levels (θ) are estimated (Embretson & Reisse, 2000).

In Marginal Maximum Likelihood (MML), the problem of unknown person parameters is taken care of by specifying a population distribution of person parameters (e.g., the researcher may believe the trait level assessed by the test is normally distributed in the population). This belief about the distribution of θ is then incorporated into the estimation procedure by dividing the trait distribution into discrete segments, finding the probability of randomly selecting a score from each segment, and then, for each response item, finding the expected number of people getting it correct (Embretson & Reisse, 2000). This process enables the calculation of a data likelihood, and then parameters are iteratively adjusted to improve the data likelihood until changes in consecutively estimated parameters are negligible. This entire process has been formalized by Bock and Aitken (1981) as the EM algorithm—expectation and maximization.

Complexity and Limitations

JML is quite simple, in terms of algorithms. In addition, it can be used to estimate many models, including the 2PL and 3PL (Embretson & Reisse, 2000). However, JML parameter estimates are biased and inconsistent, leading to questionable interpretations of model parameters and standard errors (Embretson & Reisse, 2000). In addition, JML is not able to handle persons with perfect scores (i.e., examinees who either responded incorrectly or correctly to ALL items).

MML is rather complex, in terms of algorithms; however, advanced computational power has really made MML estimation possible. MML can handle all types of IRT models, it can provide estimates for perfect (i.e., all right or all wrong) scores, and its estimates are consistent. A single disadvantage to MML is that a distribution of θ is assumed; therefore, the parameter estimates are only as good as the accuracy of the assumed distribution. However, Embretson and Reisse (2000) suggest that “item parameter estimates appear to be not strongly influenced by moderate departures from normality, so inappropriately specified distributions may not be very important” (p. 214).

CML has the advantage that no trait-level distribution is assumed. In fact, thanks to sufficient statistic of a person’s total score, mathematical problems with unknown trait levels are not a problem. However, this does limit CML’s application to Rasch-family models only (i.e., models for which the total score is a sufficient statistic). In addition, CML is not able to provide estimates for perfect scores, and estimation often becomes difficult, if not impossible, for tests with many items.

(b)

The Bayesian approach to parameter estimation takes advantage of the fact that, for the most part, psychometricians are not completely in the dark about what distribution of values to approximately expect when models are estimated. For example, it is relatively certain that θ will be normally distributed, it makes sense that difficulty parameters will probably have a certain mean (a high mean for a hard test; a low mean for an easy test) and standard deviation, etc. In the Bayesian framework, this expectancy, or prior distribution can greatly facilitate the model estimation process. Starting with a prior distribution for each parameter that I desire to estimate, I then can use those prior distributions in order to estimate a posterior distribution of the model parameters.

This approach is somewhat similar to MML as discussed above. Recall that in MML, one begins with a belief about the distribution of θ. Making this assumption allows the model to be estimated. In a Bayesian approach, one begins with a belief about the distribution of all model parameters (i.e., person and item parameters) in order to estimate the model.