Identifying the Sources of Person Misfit:

Combining Quantitative and Qualitative Approaches

Alexandra Petridou

Learning, Teaching and Assessment Research & Teaching Group

University of Manchester

Seminar paper presented at a seminar of the University of Manchester, School of Education, Learning, Teaching & Assessment Research and Teaching Group, November 2004

Abstract

Person-fit statistics aim to detect aberrant response behaviour as this may have detrimental effects on the quality and validity of measurement. Although there is a large number of indices in the literature that were developed to identify aberrant examinees, the reasons that lead an examinee to provide aberrant responses remain until today largely unknown. This is because finding an unexpected or unusual pattern does not provide an explanation for this aberrance. In the literature many authors have suggested various possible reasons that may lead to aberrant response behaviour. These however have not been systematically investigated. This study combines quantitative and qualitative approaches in order to identify potential reasons for person misfit using real data. The data comes from a mathematics test containing 45 constructed and multiple choice items and two questionnaires that are used to gather background information about the examinees. The quantitative part adopts two methods to analyse the performance of the individuals on the test. Specifically, the first one examines aberrant behaviour under the Rasch model using the Infit and Outfit person-fit statistics. On the basis of the Infit and Outfit values a new variable is constructed that becomes the response variable in a two-level model. The second method adopts the multilevel logistic regression approach proposed by Reise (2000). The qualitative part follows-up and interviews two examinees who have provided misfitting response patterns. The main outcomes of the study to date are: (i) that the multi-level methodology is promising but will need a larger data set to yield substantive results and (ii) the case-studies yielded a number of explanation of misfit in this context that are worthy of further investigation.

1Introduction

When examinees take a test their responses are expected to conform to some standard of reasonableness (Smith, 1986): for instance, a response pattern that involves a significant number of wrong answers to `easy' questions but right answers to `hard' questions would be regarded as `aberrant'. Such an aberrant response pattern would be signalled by a high `misfit' statistic computed from the deviations of these responses from those `expected'. Detecting such unexpected response patterns has been investigated by many researchers in the last decades. The rationale behind this effort was the claim that the test scores of these examinees with unexpected response patterns may fail to provide a useful and valid measure of ability.Unexpected response patterns may result from: student misconceptions, cultural differences, atypical schooling, language deficiencies, anxiety, lack of motivation, faulty test construction, ethnicity, external distractions, fatigue and many more.There is a large number of indices in the literature which were developed in order to identify aberrant response patterns. However statistical misfit is just an indication of problematic examinee performances and that is all. Person-fit statistics generally indicate whether someone has an unusual response pattern and not why such a pattern has occurred. The statistical model cannot tell us more about the reasons that led an examinee to generate these responses. This is because even if a pattern is statistically identified as aberrant the researcher cannot always be sure of the kind of aberrance underlying test performance. Most researchers seem to agree that factors causing the aberrance can be very complex and their identification needs something more than mere statistics. This has been one of the main reasons that the factors that lead some examinees to generate aberrant responses remain today largely unknown. Hulin, Drasgow Parsons (1983) argued that "since the underlying causes of aberrance are usually unknown little meaning can be given to an individual's test score determined to be aberrant by an appropriateness index" (p.149).

The purpose of my study is to examine the validity of test scores from the point of view of `misfitting' or `aberrant' students and classes. Validity refers to the “degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests”. The above definition of validity is given by the Standards for Educational Testing (AERA et al., 1999).What becomes obvious is that the main emphasis is on scores not on tests as tests do not have reliabilities and validities. Test responses have these properties and are a function “not only of the items, tasks or stimulus conditions but of the persons responding and the context of measurement” (Messick, 1993, p.14). There are many different sources of evidence that could be used in evaluating proposed interpretation of test scores and these different sources can shed light on different aspects of validity. This study will examine the validity of test scores from the point of view of misfitting response patterns as the test scores of these examinees maybe invalidly low (because of language deficiencies, alignment error etc) or invalidly high (because of cheating, copying etc). In order to be able to give meaning to these scores and make valid inferencesbased on these test scores we need to know the reasons behind this unexpected responding. Otherwise no trust can be put in the inferences based on these scores.

Particularly this study will provide answers to the following question: what factors lead examinees to provide unexpected response patterns and more specifically to what extent is misfit attributable to class level variables or individual characteristics. The design of this study involves hierarchical statistical models to identify such classes and individual students, followed-bycase studies of particular students and classes to elicit insights into the causes of their statistical `aberrance'. This study is expected to be completed in September 2006.

This paper reports some preliminary findings from a pilot study that took place between May and July 2004. This pilot study was conducted in order to run a small scale version of the main study but also to pilot test the research instruments that will be used in the main study. Specifically the objectives of this pilot study were to:

  • Pilot test the research instruments i.e. two questionnaires and semi-structured interview's plan
  • Assess the proposed data analysis techniques (both qualitative and quantitative) in order to examine whether these are feasible and
  • Uncover any potential problems in the research process

In general this study was a feasibility study and aimed to provide some feedback on the methodology and methods employed before the main study but also to provide some preliminary results.

2Study Sample

Four-hundred sixty-nine Year 6 pupils nested within 22 classes from various primary schools in the area of Manchester composed the sample of this study. However only for two hundred and thirty three pupils there was available background information data. The schools from which pupils were selected were different in terms of socio-economic background, ethnicity composition and national curriculum levels in Mathematics. This study used a convenient sample in terms of spatial proximity and easy access.

3Design

A two-phased methodology was used as this pilot study combinedquantitative and qualitative methods (Figure 1). In particular, phase A was a quantitative study that aimed to identify examinees with misfitting response patterns and then study the statistical relationships between misfit with various individual and class background variables. Phase Bstarted one month after the first phase. This phase included the study of specific cases of individuals. The purpose of the case studies was to build a profile of these pupils, identify any existing relationships, patterns and mechanisms and finally and more importantly try to explain the causes of their statistical misfit.

Figure 1: The design of the methodology and methods

4Phase A: Quantitative Study

4.1The Research Instruments

The quantitative part of this study included the following instruments: a Mathematics test (MaLT test),a student and a teacher questionnaire.

4.1.1The Test

The test used to collect the data was obtained from the Mathematics for Learning and Teaching (MALT) project of the University of Manchester, which collects diagnostic information and standardize mathematics tests for years Reception to nine. The test used in the pilot study was the Year 6 test and it was designed to cover the full range of levels and content of the mathematics programme of study for Year 6. The test was composed of 45 one mark point questions and it was divided into two parts, a calculator and a non-calculator section. The pupils had approximately 45 minutes to an hour to complete the test.

4.1.2The Questionnaires

This study involved two questionnaires, a student and a teacher questionnaire. The purpose of the questionnaires was to collect background information that would be entered as explanatory variables during analysis in a multilevel model. The questionnaires tried to elicit background variables believed to be of significance as they came up often in the literature as possible sources of aberrance. The student questionnaire was administered immediately after the completion of the test while the teacher questionnaire was completed when pupils were taking the test. Pilot testing the questionnaires was considered of paramount importance in order to check the clarity and wording of questionnaire items, the difficulty and time needed to complete the questionnaire.

The student questionnaire had as a purpose to elicit information about pupils' sex, ethnicity, socio-economic background in terms of parents' education, language background i.e. if another language other than English was spoken at home, anxiety before taking the test, motivation and effort in taking the test, the level of difficulty and speed of the test. Motivation and anxiety was measured with the help of existing scales found in the literature. The teacher questionnaire elicited information on ethnicity, gender, years of experience, teachers' education and training and on instructional methods used in Mathematics. In both questionnaires respondents were asked to show their interest in participating in follow-up interviews.

In the following section the data analysis methods will be described and some preliminary findings will be presented.

4.2Analysis

Two methods were piloted for the statistical study of person-fit. The first one involved identifying examinees with misfitting response patterns using two general-purpose fit statistics i.e. Infit Mean Squareand Outfit Mean Square.These two person-fit statistics were used in order to examine whether data but more importantly pupils were consistent with the Rasch model. In order to distinguish inconsistent performances (so as to be examined further) a criterion was required to be set. In order to identify this criterion, a simulation method proposed by Linacre (personal communication) was used, but having at the same time in mind the limitations of setting a mathematical criterion.

On the basis of the Infit and Outfit values,two new variables were created.Examinees with Infit and Outfit values above 1.3 (criterion from simulation method)were coded as 1, otherwise as 0. These new variables became the response variables in a two-level logistic model, where individuals were nested within classes. Explanatory variables could then be entered in the model in order to examine the degree to which specific variables at different levels could account for the generation of aberrant response patterns.

The second method to study person misfit involved anew method proposed by Reise (2000) which incorporates a multilevel approach to IRT person misfit detection.Specifically, this method uses a multilevel logistic model to evaluate the fit of an IRT model, to an individual’s item response pattern after item and person parameters have been estimated using standard IRT software. According to this method item responses are treated as nested within individuals and a multilevel logistic regression is used to estimate a person-response curve that relates examinee response probability to item difficulty. This person-response curve models how an individual’s item endorsement rate diminishes as a function of item difficulty (Reise, 2000). The slope of the individual’s person response curve reflects the consistency of the individual’s response pattern and is an indicator of person-fit.

Before presenting the two methods in more detail with some preliminary results, it is important to report some descriptive statistics about the sample. Due to missing data the final sample size reduced to 170 pupils from 233. From the 45 items of the test only 44 were included in the analysis as one item had a mistake in the instruction part and in order to be fair was excluded from the analysis. Table 1 presents the sample’s characteristics.

Table 1: Demographic data of the sample

Background Variables / Percentage
Gender
Male / 51.8
Female / 48.2
Ethnicity
White / 57.1
Black / 30
Asian / 12.9
Other language than English spoken at home
Yes / 32.8
No / 68.2
Mean Estimate of Ability = -0.0867 (logits)

4.2.1Method 1

Although the first method seemed promising, in the application there was one very important limitation i.e. sample size. Multilevel modelling with small sample sizes raises questions about the accuracy for the parameter estimates.Kreft (1996) suggested the “30/30 rule” as a rule of thumb for sample sizes. Specifically Kreft (1996) suggested that in order to be on the safe side researchers should try to obtain a sample of at least 30 groups with at least 30 individuals per group. For certain applications this rule of thumb can be modified. Research has shown that in order to have accuracy and high power it is more important to have large number of groups than a large number of individuals per group.

When applying Method 1 the whole sample was used i.e. 469 pupils nested within 22 classes. From the analysis outputs it was obvious that parameter estimation was very unreliable (i.e. very large standard error values). As a result model building procedures could not be pursued further.However the intra-class correlation could be estimated from the null model. Table 1 reports the intra-class correlation calculated from the Infit and Outfit two-level binary models. The intra-class correlation is an indication of the proportion of the overall variation in misfit that is attributable to higher-level units i.e. in this case class. Specifically the intra-class correlation (ρ) is not really a correlation but a measure of strength of association.

Table 2: Intra-class correlations of two binary multilevel models

Model 1: Response variable INFIT MSQR / Model 2: Response variable OUTFIT MSQR
Intra-class correlation / 0.086 / 0.044

Table 2,shows that roughly the 8.6 % of the total variance in misfit as indicated by the Infit person-fit statistic can be explained by class membership, while 4.4% of the total variance in misfit as indicated by the Outfit statistic can be explained by class.These results indicate that the most of the variance in misfit as indicated by Infit and Outfit statistics is at the individual level. The variation in misfit attributed to class membership as indicated by Infit statistic is considered quite highand explanatory variables at the class level could have been enteredin order to explain such variation. The variation however in misfit attributable to class membership as indicated by Outfit statistic is quite small. Consequently, there is no point entering class-level predictor variables but we can still enter variables at the individual level. As this study’s sample was quite small, it did not allow this kind of pursuit. Moreover background information was not available for all the pupils and classes in the sample. This method will be attempted in the main studywhere the dataset will be much larger and background information will be available at the individual and class level for the whole sample.

4.2.2Method 2

Method 2 treated items as nested within pupils, and an additional level was added i.e. pupils nested within classes. Pupils and classes became in this method,the level-two and level-three groups respectively. In method 2, the data from only the 233 pupils were used as for these pupils there was available background information.

The basic multilevel logistic model used in Method 2, is the following

Basic Model

Level-One

(1)

Level-Two

(2)

(3)

Level-Three

(4)

(5)

(6)

(7)

(8)

.

.

In the level-one model the response variable is the probability of an item endorsement or the natural log odds of success which is regressed on item difficulty (). The intercept term () is the log odds of item endorsement when item difficulty equals zero. The slope term () indicates how item endorsement rate decreases as item difficulty increases. It is the empirical Bayes estimates of person slope parameters that are used as indicators of person fit ( values near zero indicate poor person-fit). Intercept and slope parameters for each individual in this model are treated as random coefficients. Thus their level-one regression coefficients (i.e. and for the jth individual) become the level-two response variables which are regressed on latent trait () and on level-two variables () respectively. The terms and represent level-two residuals. Each of the level-two coefficients defined in level-two, become a response variable in the level three model. The terms are level-three variables while the terms and are level-three residual terms.