Rolf Steyer & Ivailo Partchev: Latent state-trait theory in computerized adaptive testing1

Latent State-Trait Theory in Computerized Adaptive Testing

Rolf Steyer[1] and Ivailo Partchev

Institute of Psychology, Friedrich Schiller University Jena, Germany

Computerized Adaptive Testing (CAT) has been implemented for a number of tests in personnel selection and placement of the German armed forces. However, the underlying item response models ignore the fact that psychological testing does not take place in a situational vacuum. We all know that our physical and mental performances are subject to fluctuations within and between days. Hence, tests at a single occasion of measurement are not only affected by measurement error but also by situation effects and/or interaction between persons and situations. Latent State-Trait Theory (LST theory; see, e.g., Steyer, Schmitt & Eid, 1999) provides a theoretical framework within which models have been developed that allow quantifying both (a) measurement errors and (b) situation and/or interaction effects, provided that there are several measures of the same construct on at least two occasions of measurement. We give a short outline of the theory and show how it can be applied to measures resulting from computerized adaptive testing. A simulation study is presented mimicking the CAT procedure of the German armed forces. The goal of the simulation is to study the effects of (a) the precision in the estimates of the person parameters as a stop criterion and (b) the person sample size on the estimates of the trait variance and variance of the situation and/or interaction effects. Surprisingly, and in contrast to results from a previous simulation study with fixed (nonadaptive) item presentation, the LST model presented only works satisfactorily for high precision in the estimates of the person parameters and a large (person) sample size.

Personnel testing aims at assessing traits allowing for the prediction of behavior in concrete situations. However, behavior in concrete situations is not only due to traits but also to the bio-psycho-social situation in which it occurs as well as to the interaction between persons and these situations. Taken together, the situation and the interaction effects are referred to as situational specificity.

Situational specificity not only affects behavior in ordinary professional activities but it also determines behavior in psychological tests to some degree. In principle, there are two ways of dealing with this fact. First, one might improve the prediction from testing to real life behavior by keeping constant characteristics of the situation. Second, one may consider situational specificity as systematic within but random between occasions, estimate its practical significance via a proportion of variance, and control for it by appropriate statistical models. This can be done just in the same way as ordinary measurement error in psychological assessment is estimated and controlled for. This second route is taken in latent state-trait theory.

Originally, latent state-trait theory (LST theory) (Steyer, Ferring, & Schmitt, 1992; Steyer, Schmitt & Eid, 1999) dealt with continuous observables and has been developed as a generalization of classical test theory (CTT). The basic idea underlying LST modeling is to deal with situation and/or interaction effects just like we deal with measurement error in CTT models, in which we may estimate the variance of measurement errors via multiple measures (indicators) of an identical latent variable. Therefore, it is a natural extension of CTT to estimate the amount of situation and/or interaction effects via repeated measures of an identical, time invariant latent trait, the measures being separated by an appropriate time span.

Although LST models were initially developed for continuous variables, extensions to probit models have been presented by Eid (1995b; Eid & Hoffmann, 1998) and to the logistic partial credit model by Steyer and Partchev (2000).

In the present paper, we will focus on extending LST theory to computerized adaptive testing. We give a short outline of the theory and show how it can be applied to estimates of person parameters resulting from computerized adaptive testing. A simulation study is presented mimicking the CAT procedure of the German armed forces. The goal of the simulation is to study the effects of (a) the precision in the estimates of the person parameters as a stop criterion and (b) the person sample size on the estimates of the trait variance and the variance of the situation and/or interaction effects.

  1. The Singletrait-Multistate Model for Person-Parameter Estimates

In this paper, we will treat the person-parameter estimate (for person u at time t)as a value of an ordinary test score variable. In contrast to ordinary applications of classical test theory the person parameter estimate has a person-specific standard error SE(). Squaring this standard error yields the person-specific measurement error variance . Taking the expectation of these person-specific variances over the sample of persons gives an estimate of the error variance Var(t) of the test score variable In this way we may determine the proportions of measurement error variance and true variance for a person-parameter estimate resulting from any CAT procedure [see Eqs. (1) to (4) in Table 1].[2]

Table 1. CTT model for the person parameter estimator

= t +  t(1)

Cov(t, t) = 0 (2)

Var(t) = E[Var()](3)

Var(t ) = Var() Var(t) (4)

Table 2. LST model for the person parameter estimator

=  + t + t (5)

Cov(t, t) = 0 tt(6)

Cov( , t) = 0(7)

Cov( , t) = 0 (8)

Cov(t, t) = 0 tt(9)

Cov(t, t) = 0 (10)

Var() = Cov(,) tt (11)

Var(t) = Var() Var() Var(t) (12)

According to LST theory, the person parameter has again to be decomposed into two parts: the trait and the effect of the situation and/or the interaction between the person and the situation in which the assessment is made [see Eq. (5) in Table 2]. Assuming an invariant trait over at least two occasions of measurement and some zero correlations [see Eqs. (6) to (10) in Table 2] allows to estimate the trait variance and the variance due to situation and/or interaction effects [see Eqs (11) and (12) in Table 2].

  1. Simulation Studies

In order to investigate the validity of the procedure outlined above, we conducted two simulation studies. As mentioned before, the goal of the simulation was to study the effects of (a) the precision in the estimates of the person parameters as a stop criterion and (b) the person sample size on the estimates of the trait variance and the variance of the situation and/or interaction effects.

The data in Study 1 were generated according to the following equation:

,(13)

where the values of the probability function P(Yik = 1U, St) are the conditional probabilities of solving (Yit = 1)and not solving (Yit= 0)item i on occasion t given the person U = u and situation St = st. As before,  denotes the latent trait variable and t the latent state residual, the values of which are the effects of the situation and/or the interaction at occasion t. The two latent variables  and t are uncorrelated.Finally, each parameter i denotes the difficulty of item i. These difficulties are assumed to be invariant over time. We refer to this model as the one-parameter logistic test model (or Rasch model) with invariant item difficulties.

With this model, we generated data by an adaptive test procedure for a sample size of 10000 persons. For the analyses described below the first 100, 200, 500, 1000, 2000, 5000, and 10000 persons were selected. We allowed for 43 different item difficulties ranging from 3 to + 3, with equal distances between the difficulties. Hence, the most difficult item had difficulty 3, the second most difficult item had difficulty 2.85714, the third most difficult item had difficulty 2.71429 etc.

The values of the latent trait variable  were generated via standard normal random numbers. Hence, E() = 0 and Var() = 1. Similarly, the values of the latent state residual twere generated via normally distributed numbers with variance .30, i.e., Var(t) = .30 for all three occasions t of measurement. Adding the values of  and tfor each person yields the value of the occasion-specific ability variable t. These parameters allow to compute the solving probabilities according to Equation (13), which are then used to generate the manifest responses, 0 or 1 via a random number generator. The value 1 is generated with probability P(Yit = 1U, St) and the value 0 with probability 1 P(Yit = 1U, St).

The adaptive testing procedure starts by assuming that the person has ability 0. The item presented first is selected maximizing the information function under a one-dimensional model with occasion-specific ability t. If the first item is solved, the person ability is raised by .25 and again the item presented next is selected by maximizing the information function. As soon as an item is not solved, the person parameter is estimated via maximum likelihood and the next item is selected using the information function again. This process is continued until the stop criterion for the standard error of the person-parameter estimate is reached. These stop criteria can be found in the columns of Table 3. If the first item is not solved the same procedure is applied with the exception that the person ability is decreased by .25. This adaptive testing procedure was repeated three times (i.e., at three ‘occasions’) independently of each other for all 10000 persons, i.e., no information of the testing on a previous occasion of measurement was used in the adaptive testing procedure of the current occasion.

This procedure yields, for each person, three person parameter estimates and their standard errors. These estimates and their standard errors were the data basis for the procedure described in the next section. The essential parameters to be reproduced in a statistical analysis are the variance of , the population value of which is 1, and the variance of the t, the population value of which is 0.30.

In Study 2 we generated data according to the two-parameter logistic test model (or Birnbaum model) with invariant item difficulties:

, i > 0.(14)

In this model, i denotes the discrimination parameter which is a weight for the occasion-specific ability t =  + t.

All other details of the data generation are the same as in Study 1, except for the distribution of the item difficulty and discrimination parameters. In Study 2 these parameters were taken from an item bank of the mathematics test of the German armed forces. It was secured that any item was only used once for a given ‘person’.

2.1Data Analyses

The simulation studies followed a two-factorial design. The first factor is the precision of the person-parameter estimate as the stop criterion. The second factor is the sample size with samples of 100, 200, 500, 1000, 2000, 5000, and 10000 persons. The aim of the studies was to find out which sample size and which precision yield reliable estimates of the variances of the latent trait  and the latent state residuals t.

For estimating the variance of the latent trait variables  we used the following formula:

,(15)

where denotes the sample covariance of the estimates or the occasion-specific person parameters at occasions t and t. Note that all population covariances are equal to Var() [see Eq. (11)]. Analogously, we estimate the variance of the occasion-specific person-parameter estimates by:

,(16)

because the variances are all equal. The error variance is estimated by

.(17)

Here, too, the population error variances are equal for all three occasions. Finally, the variance of  is estimated by

.(18)

While Equations (15) to (18) refer to sample data that would also be available in the normal empirical case, we may also consider estimates that use information only available in simulation studies such as the true occasion-specific ability scores tu of the persons on each of the three occasions. Using this information, we may also estimate the error variance by:

.(19)

This formula yields the actual error variance of the estimators , at least for large person samples. In contrast to formula (17), this formula is only applicable in simulation studies. Comparing the results of formula (19) to the results of formula (17) will allow to check the validity of the estimates of the standard errors of the person-parameter estimates. Remember, theorems underlying the estimates of the standard errors only make asymptotic propositions (s. z.B. Steyer & Eid, 1993, S. 231 oder Fischer, 1974, S. 297; Hoijtink & Boomsma, 1996), especially only: „for number of items approaching infinity …” The results of our simulation study emphasize this caveat.

2.2Results

Studying Table 3, we can make the following observations: The quality of the estimates of the variance of  depends both on sample size and stop criterion. Only for large samples and high precision stop criteria the estimates of the variance of  approach the true value 1. These results can only be explained if measurement error t and true ability t correlate [see Eq. (11)]. In fact, correlating 1 and 2 := 2 in the one-parameter model, sample size 10000, precision 1.075 for instance, yields .128 although the true correlation should be zero. This and the corresponding correlations clearly indicate that the quasi-CTT approach investigated in this paper does not hold for low sample size and low precision conditions.

The validity of the estimates of the variance of the occasion-specific abilities displayed in the second super row of Table 3 is more difficult to evaluate, because these estimates depend not only on known parameters but also on the variance of the measurement error variable  t := t[see Eq. (12)].

The third super row in Table 3 displays the estimates of the measurement error variance according to Equation (17). Of course, these estimates are all smaller than the corresponding stopping criterion. More important, however, are the systematic deviations of these estimates form the actual error variances [computed according to Eq. (19)] displayed in the fourth super row. The variances seem to slightly underestimate the actual error variances for large samples and high precision conditions, whereas they overestimate them considerably in the small sample and low precision conditions.

The estimates of the variance of  are displayed in the fifth super row of Table 3. They are quite acceptable for the two-parameter model with sample sizes 500 and above. They are completely invalid for the lowest precision condition in the one-parameter model.

Finally, the last super row in Table 3 gives the average number of items needed to reach the stop criterion. It shows a clear superiority of the two-parameter model. This superiority increases with a decrease of the stop criterion.

  1. Discussion

Surprisingly, and in contrast to results from a previous simulation study with fixed (nonadaptive) item presentation (Steyer, Partchev, Seiß, Menz, & Hübner, 2000), the LST model presented only works satisfactorily for high precision in the estimates of the person parameters and large (person) sample sizes. In these conditions, the only data we need to estimate the occasion-specificity are estimates of the person parameter and its standard error on at least two occasions of measurement. Estimating the occasion specificity is important for personnel selection because IRT-based CAT estimates of person parameters are only estimates of occasion-specific abilities. Only knowledge of occasion specificity allows an informed judgment on the degree to which we may generalize from the occasion-specific ability to the underlying trait score. If there is a nonnegligible occasion-specificity, the standard errors of the person parameters overestimate the dependability of the measurements. Future empirical studies will explore this question for real test procedures used in the German armed forces.

The reason why the procedure does not work satisfactorily in the small sample size and low precision conditions seem to be two-fold. First, there is a correlation between person parameter estimates and the error of these estimates, which jeopardizes the quasi-CTT approach taken. Second, there is a considerable overestimation of the error variances for the low precision conditions. Further simulations showed that these problems may be diminished by choosing larger steps than .25 in the beginning of the CAT procedure. Using steps of 1.0 or presenting randomly chosen items in the beginning both yield better estimates of occasion-specificity than the steps investigated here that are used in the German armed forces.

References

Eid, M. (1995). Modelle der Messung von Personen in Situationen [Models of measuring persons in situations]. Weinheim: Psychologie Verlags Union.

Eid, M., & Hoffmann, L. (1998). Measuring variability and change with an item response model for polytomous variables. Journal of Educational and Behavioral Statistics, 23, 193-215.

Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests [Introduction into the theory of psychological tests].Bern: Huber.

Hoijtink, H. & Boomsma, A. (1996). Statistical inference based on latent ability estimates. Psychometrika, 61, 1996, 313-330.

Steyer, R., Ferring, D., & Schmitt, M. J. (1992). States and traits in psychological assessment. European Journal of Psychological Assessment, 8, 79-98.

Steyer, R. and Partchev, I. (2000). Latent state-trait modeling with logistic item response models. In: R. Cudeck, S. du Toit, and D. Sörbom (Eds.), Structural Equation Models: Present and Future, Chicago: Scientific Software International.

Steyer, R., Schmitt, M., & Eid, M. (1999). Latent state-trait theory and research in personality and individual differences. European Journal of Personality, 13, 389-408.

Table 3. Results of the simulation studies

Study 1: One-parameter model / Study 2: Two-parameter model
precision (or stop criterion) / precision (or stop criterion)
1.075 (.48) / .494 (.72) / .338 (.80) / .171(.88) / 1.075 (.80) / .494 (.80) / .338 (.83) / .171 (.89)
[1.0]
100 / 0.518 / 0.695 / 0.743 / 0.680 / 0.629 / 0.637 / 0.640 / 0.686
200 / 0.546 / 0.775 / 0.758 / 0.725 / 0.599 / 0.620 / 0.637 / 0.691
500 / 0.696 / 0.889 / 0.883 / 0.906 / 0.732 / 0.755 / 0.772 / 0.896
1000 / 0.699 / 0.910 / 0.898 / 0.929 / 0.781 / 0.793 / 0.788 / 0.910
2000 / 0.750 / 0.941 / 0.904 / 0.953 / 0.786 / 0.790 / 0.830 / 0.929
5000 / 0.772 / 0.944 / 0.972 / 0.988 / 0.822 / 0.834 / 0.853 / 0.958
10000 / 0.768 / 0.941 / 0.968 / 0.988 / 0.837 / 0.828 / 0.860 / 0.962
100 / 1.438 / 1.464 / 1.421 / 1.126 / 1.179 / 1.079 / 1.128 / 1.119
200 / 1.443 / 1.470 / 1.397 / 1.167 / 1.160 / 1.087 / 1.136 / 1.141
500 / 1.673 / 1.638 / 1.506 / 1.369 / 1.281 / 1.284 / 1.335 / 1.381
1000 / 1.692 / 1.623 / 1.518 / 1.370 / 1.295 / 1.284 / 1.304 / 1.360
2000 / 1.705 / 1.644 / 1.505 / 1.404 / 1.291 / 1.286 / 1.345 / 1.373
5000 / 1.740 / 1.675 / 1.575 / 1.442 / 1.334 / 1.337 / 1.356 / 1.401
10000 / 1.749 / 1.667 / 1.571 / 1.440 / 1.341 / 1.328 / 1.360 / 1.405
100 / 0.929 / 0.463 / 0.323 / 0.167 / 0.272 / 0.271 / 0.236 / 0.148
200 / 0.928 / 0.463 / 0.323 / 0.167 / 0.265 / 0.268 / 0.232 / 0.146
500 / 0.921 / 0.463 / 0.323 / 0.167 / 0.265 / 0.267 / 0.231 / 0.149
1000 / 0.922 / 0.462 / 0.323 / 0.167 / 0.268 / 0.267 / 0.233 / 0.150
2000 / 0.922 / 0.462 / 0.323 / 0.167 / 0.267 / 0.267 / 0.234 / 0.152
5000 / 0.921 / 0.463 / 0.323 / 0.167 / 0.266 / 0.265 / 0.231 / 0.151
10000 / 0.921 / 0.463 / 0.323 / 0.167 / 0.266 / 0.265 / 0.232 / 0.151
100 / 0.783 / 0.445 / 0.302 / 0.163 / 0.261 / 0.300 / 0.218 / 0.140
200 / 0.744 / 0.432 / 0.325 / 0.171 / 0.249 / 0.268 / 0.219 / 0.147
500 / 0.747 / 0.445 / 0.329 / 0.175 / 0.265 / 0.264 / 0.244 / 0.156
1000 / 0.748 / 0.455 / 0.331 / 0.177 / 0.270 / 0.261 / 0.239 / 0.165
2000 / 0.746 / 0.453 / 0.332 / 0.178 / 0.270 / 0.267 / 0.242 / 0.165
5000 / 0.756 / 0.468 / 0.331 / 0.173 / 0.272 / 0.264 / 0.240 / 0.161
10000 / 0.762 / 0.463 / 0.328 / 0.172 / 0.263 / 0.264 / 0.244 / 0.162
[.30]
100 / -0.009 / 0.306 / 0.355 / 0.279 / 0.277 / 0.170 / 0.252 / 0.285
200 / -0.031 / 0.232 / 0.316 / 0.275 / 0.296 / 0.199 / 0.267 / 0.304
500 / 0.056 / 0.287 / 0.300 / 0.296 / 0.284 / 0.262 / 0.332 / 0.336
1000 / 0.071 / 0.251 / 0.297 / 0.273 / 0.247 / 0.224 / 0.284 / 0.300
2000 / 0.034 / 0.240 / 0.278 / 0.284 / 0.237 / 0.229 / 0.280 / 0.292
5000 / 0.047 / 0.268 / 0.281 / 0.288 / 0.246 / 0.238 / 0.272 / 0.291
10000 / 0.060 / 0.264 / 0.279 / 0.285 / 0.239 / 0.236 / 0.268 / 0.292
Mean number of items presented until stop criterion is reached
100 / 5.1 / 9.8 / 13.8 / 25.7 / 4.6 / 4.5 / 4.9 / 7.2
200 / 5.2 / 9.7 / 13.7 / 25.6 / 4.6 / 4.5 / 4.9 / 7.2
500 / 5.3 / 9.9 / 13.8 / 25.8 / 4.7 / 4.7 / 5.1 / 7.5
1000 / 5.3 / 9.9 / 13.8 / 25.7 / 4.8 / 4.8 / 5.1 / 7.4
2000 / 5.3 / 9.9 / 13.8 / 25.8 / 4.8 / 4.8 / 5.2 / 7.4
5000 / 5.4 / 9.9 / 13.8 / 25.8 / 4.8 / 4.8 / 5.2 / 7.4
10000 / 5.4 / 9.9 / 13.8 / 25.8 / 4.8 / 4.8 / 5.2 / 7.5

[1]Address correspondence to: Prof. Dr. Rolf Steyer, FSU Jena, Dept. of Psychology, Am Steiger 3, Haus 1, D-07743 Jena, Germany. Email: . The paper will be published in the proceedings of the International Military Testing Association (IMTA) available at imta frame. htm