1
Massachusetts Adult Proficiency Tests Technical Manual Supplement: 2008-2009[1]
April L. Zenisky, Stephen G. Sireci, Andrea Martone, Peter Baldwin, and Wendy Lam
Center for Educational Assessment
University of Massachusetts Amherst
Prepared under a contract awarded by the
Massachusetts Department of Elementary and Secondary Education
7/7/09
Acknowledgements
This technical manual represents an extraordinary team effort involving many individuals from the Massachusetts Department of Elementary and Secondary Education Office of Adult and Community Learning Services (especially Jane Schwerdtfeger, Anne Serino, Donna Cornellier, and Bob Bickerton), graduate students in the Research and Evaluation Methods Program at UMASS (particularly Jenna Copella, Tie Liang, and Polly Parker), additional faculty and staff at UMASS (Craig Wells, Jennifer Randall, Ron Hambleton, and Lisa Keller), staff at the UMASS Center for Educational Software Development (Stephen Battisti, David Hart, Gordon Anderson, Cindy Stein, and Gail Parsloe), and the Office Manager for the Center for Educational Assessment—Kelly Smiaroski. Most of all, we are particularly grateful for the hard work of the hundreds of adult educators and ABE learners throughout Massachusetts who contributed to the development of the MAPT in many important ways and continuously provide feedback and support so that we can best serve the assessment needs of Massachusetts adult education programs.
Massachusetts Adult Proficiency Tests Technical Manual Supplement: 2008-2009
Table of Contents
I. Introduction and Purpose of this Supplement
II. Developing the Fiscal 2009 MAPT Assessments
Item Response Theory (IRT) Calibration
Scaling and Equating
Development of MAPT Modules and Panels
Creating Parallel Panels
III. Characteristics of Operational Item Banks
Distributions of Item Difficulty
Item exposure rates
IV. Standard Setting: Revising Cut-Scores for the NRS Educational Functioning Levels
V. Measurement Precision
Test Information Functions
Conditional Standard Error of Estimation
Decision Consistency and Decision Accuracy
VI. Validity Studies
Test Sessions Analysis
Test and Item Response Time Analyses
References
Appendices
Appendix A
List of Center for Educational Assessment Research Reports Related to MAPT
Appendix B
MAPT Test Administrations by Month: Fiscal 2008 and Fiscal 2009
I. Introduction and Purpose of this Supplement
Since January 2003, the Center for Educational Assessment at the University of Massachusetts Amherst (UMASS), under a contract awarded by the Massachusetts Department of Elementary and Secondary Education, has worked closely with the Department’s Office of Adult and Community Learning Services (ACLS) to develop achievement tests in math and reading that are appropriate for adult learners in Massachusetts. These tests, called the Massachusetts Adult Proficiency Tests (MAPT), have been operational since September 2006. Key features of these tests are that they are (a) aligned to the National Reporting System’s (NRS) Educational Functioning Levels (EFLs), (b) aligned with the curriculum frameworks established by ACLS and the adult basic education (ABE) community in Massachusetts, (c) designed to measure gain across the EFLs within the NRS, (d) consistent with the instruction in ABE classrooms as directed by these frameworks, and (e) developed with comprehensive input from teachers and administrators from the ABE community in Massachusetts.
In accordance with the Standards for Educational and Psychological Testing (American Educational Research Association (AERA), American Psychological Association, & National Council on Measurement in Education, 1999), we have published several reports (see Appendix A) and two Technical Manuals (Sireci et al. 2006; Sireci et al., 2008) to inform test users about the technical characteristics of the MAPT assessments and to facilitate the interpretation of test scores. This Technical Supplement is designed to continue the important tradition of documenting the important and complex information on the MAPT testing program. We created this supplement instead of a new technical manual because many of the characteristics of the assessment remain the same. The tests continue to be targeted to the most recent versions of the MA ABE Curriculum Frameworks and are linked to the EFLs designated in the NRS. However, there have been several improvements in the MAPT during the 2008-2009 fiscal year and these improvements require documentation. Thus, this Technical Supplement describes changes that were introduced in the MAPT system since the last Technical Manual was written (Sireci et al. 2008). These chapters in this supplement include descriptions of our current procedures for recalibrating the item banks, assembling test forms, and establishing the cut-points for the EFLs. In addition, we document the technical characteristics of the assessments that were in place for the 2008-2009 fiscal year.
Like the earlier Technical Manuals, the purpose of this Supplement is to do the best job we can to inform current and potential users of MAPT assessments of its technical characteristics. This Supplement, in conjunction with the most recent version of the Technical Manual (Sireci et al., 2008) will provide interested parties with all relevant information for evaluating the basic and technical characteristics of the MAPT including(a) the purpose of the tests and their intended uses; (b) the content of the tests; (c) the processes used to develop, validate, and set standards on the tests, (d) the processes used to ensure assessment equity across diverse learners; (e) the technical characteristics of the test such as measurement precision, score scale characteristics, and multistage test administration design; and (f) how to properly interpret the test scores. Given that the Standards stress the importance of publicly available technical documentation, we believe it is important to maintain the tradition of comprehensive documentation of the MAPT testing program. Additional information regarding test administration and other policies related to assessing the proficiencies of adult learners in Massachusetts can be found on the ACLS web site ( That web site contains links to practice test and other information regarding the MAPT, as well as the Assessment Policies and Procedures Manual.
Although this Supplement contains comprehensive information about the MAPT, readers unfamiliar with the basic operations and characteristics of the MAPT will want to first review the MAPT Technical Manual (Version 2) (Sireci et al., 2008), since most of the important information pertaining to the design and content of the MAPT is not repeated here. The purpose of this Supplement is to document changes in the technical characteristics that occurred since fiscal 2008 and to document statistical information regarding test and item characteristics for fiscal 2008.
This Technical Manual is intended to be informative to several audiences; however, some sections may require familiarity with statistics and other aspects of psychometrics. The types of professionals who may be interested in the material in this manual include ABE teachers, administrators, and staff; representatives from the US Department of Education and other organizations interested in evaluating the technical merits of the MAPT; representatives from ABE programs in other states who want to understand the strengths and limitations of MAPT; and members of the psychometric and educational research communities who are interested in test evaluation and computer-based testing.
Although this manual represents a comprehensive description of the technical qualities and features of the MAPT, these tests are not intended to remain static. We expect the MAPT to evolve and improve as we continue to monitor its functioning. As changes occur, additional documentation will be produced to inform interested parties.
II. Developing the Fiscal 2009 MAPT Assessments
Item Response Theory (IRT) Calibration
The initial item calibrations for the MAPT in 2006 used a one-parameter (Rasch) model to place all items on a common scale. This model was chosen primarily because the sample sizes for the items tended to be small. After piloting items in the operational context and administering operational items for a year, sample sizes for many items were considerably larger and so we explored finding the most appropriate IRT models for MAPT data. Specifically, we evaluated the fit of 1-, 2-, and 3-parameter IRT models to the data for each item and even explored modifications of these models such as fixing one or more parameters to predetermined values or using item-specific priors. Based on analyses of model fit and measurement precision, we concluded the 3-parameter logistic (3PL) IRT model was best for estimating most item parameters and for estimating proficiency scores for examinees. For items with small sample sizes or estimation problems a modification of the 3PL model was used where the discrimination and/or lower-asymptote parameters were fixed to reasonable values. The typical (unmodified) 3PL model is defined as (Hambleton & Swaminathan, 1985):
where:
Pi is probability of a correct answer on item i;
(theta) is the examinee proficiency parameter;
ai is the item discrimination parameter;
bi is the item difficulty parameter;
ci is the lower asymptote or “pseudo-guessing” parameter;
D is a scaling factor equal to 1.7;
e is base of the natural logarithm.
Discrimination parameters (a-parameters) and pseudo-guessing parameters (c-parameters) can be difficult to estimate when examinee samples are small and so in some cases the model was further constrained by either fixing a- and/or c-parameters to a predetermined value (1.0 or .20, respectively) or placing a relatively informative prior distribution on parameters as needed. After fitting initial models to the data, we comprehensively reviewed the fit of the model for each item to the examinee data and evaluated the residual plots for each item. The priors were, in some cases, revised. Analysis and comparison of residuals across various models was a key determinant in deciding the best IRT model for each item. These analyses also eliminated some items from the operational pool if they were judged to have poor fit.
This process of calibrating and evaluating all items in the item pool for each MAPT test (Math or Reading) is repeated once each year, typically at the start of the fiscal year (July) to add new items to the MAPT item banks and reconstitute the panels. Every student who takes a MAPT test responds to 5 pilot items that do not count toward their score. All pilot items are monitored and once their sample sizes reach a threshold of 300 examinees, they are evaluated using classical item analysis. If their classical statistics (i.e., item difficulty and discrimination estimates) seem appropriate, they will be replaced with other pretest items and are considered ready for calibration during the summer when all items are calibrated onto the IRT scale and evaluated for appropriateness for inclusion in the operational tests. In this way, we can continuously try out items throughout the year.
Steps Involved in the Calibration Process
The calibration process for each subject area involves several steps. First, a data file for each subject area is created that contains all students’ responses to all MAPT items, both operational and pilot items. Next, these files are cleaned by eliminating illegitimate cases (e.g., administrative records. We then exclude any items that
(a) were responded to by less than 150 students,
(b) had point-biserial statistics less than .10, or
(c) prevented the calibration run from converging.
Once the data are cleaned an initial calibration is conducted. To maximize the probability that the calibration will successfully converge, we fix the slope and asymptote parameters for some items according to the following rules:
(a)items with exposure < 400 had slopes fixed to 1.0
(b)items with exposure < 700 had lower asymptotes fixed to .20
The dataset is then calibrated using BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 2002). The default BILOG priors were used, except as mentioned above. This initial calibration allows the item parameter estimates for the operational items to be updated based on the most recent data, and obtains item parameter estimates for the pilot items on the same scale.
The initial calibration is then evaluated by examining how well the model fits the data for each item. Items that are not fit well by the model may be eliminated, or if their slopes or asymptotes were fixed, they may be freely estimated in a subsequent calibration where their fit is re-evaluated. All items that “survive” the various iterations after the initial calibration are entered into a final calibration, again using BILOG-MG with its default priors, except as mentioned above.
Refreshing the MAPT Item Banks: Fiscal 2008
The content specifications for the MAPT Assessments are described in current version of the MAPT Technical Manual (Sireci et al., 2008). In this Supplement we describe how the item banks were refreshed in fiscal 2008 to develop new panels for the MAPT for Reading and the MAPT for Math.
With respect to items that are refreshed or replaced each year, all items in the operational bank are calibrated and evaluated (for fit to the IRT model) along with all freshly piloted items to select the items that will become operational when new test panels are rolled out each year. New test panels are scheduled to be released just prior to September 1 each year. This schedule allows us to take advantage of all items that are pilot-tested during a given fiscal year. The majority of items that comprise the new panels will be items that were operational the previous year, but the modules and paths are uniquely reconfigured each year. Details regarding the degree of rolled-over and new operational items for the current MAPT are provided in Table II-1.
For fiscal year 2008, there were 362 operational math items. Sixty-nine of these items (19%) were replaced in FY09. Only 5% of the items that were rolled over from FY08 to FY09 remained in the same module (i.e., same level, same stage). For Reading, there were 341 operational items in FY08. Of these items, 115 (34%) were replaced in FY09. Nine percent of the items that were rolled over from FY08 to FY09 remained in the same module (same level, same stage). These data are summarized in TableII-1. Currently (fiscal year 2009), 70 new math items and 94 new reading items became operational. These new items represent 19% and 29%, respectively, of the operational items in each subject area. We also have a pool of items that are ready to be integrated into operational tests (126 math items and 219 reading items) beginning in fiscal year 2010.
Table II-1
Comparison of FY08 and FY09 Operational Items
Math / Reading# Items / # Items
FY08 Operational Items / 362 / 341
Items Rolled Over to FY09 / 293 / 226
Rollovers that remained in the same module / 16 / 21
Same item order in module / 0 / 0
New Operational Items for FY09 / 70 / 94
FY09 Operational Items / 363 / 320
Future Potential Operational Items / 126 / 219
While we have no fixed number or percentage of items that are refreshed each year, we continue to embed pilot items on the operational tests and expect to add 70-100 new items each year. At the end of fiscal 2007, 13% of the piloted math items and 6% of the piloted reading items were retired due to poor statistics. The fiscal 2008 pool of operational items includes 51% of the math items and 45% of the reading items that we piloted in the previous fiscal year. The remaining items that were piloted, but did not become operational in fiscal 2009, will be recalibrated along with the items being piloted this year and could become operational in fiscal 2010 or some other point in the future.
Scaling and Equating
In this section, we explain how items are calibrated onto the MAPT score scale, including how new items are added to the item banks and placed on the appropriate scale. We also describe the transformation from the item response theory (theta) scale to the 200-700 MAPT score scale.
The MAPT for Reading and the MAPT for Mathematics and Numeracy each have their own unique score scale. Each score scale ranges from 200 to 700, with each 100-point interval corresponding to an EFL. Regardless of subject, the underlying IRT proficiency (theta) scale was defined by fixing the mean difficulty level of all items written for Low Intermediate to be zero (i.e., the theta scale was centered at the mean difficulty of all items written for Level 3). This occurred during the 2006 Rasch calibration, so defining the origin of the scale was sufficient to identify the model.
Two advantages of IRT models are that item difficulty and examinee proficiency are on a common, equal-interval scale and that the model parameters are invariant—up to a linear transformation—for a given population of examinees. The qualifier up to a linear transformation is necessary because linearly transforming the scale does notaffect the probability of a correct response. That is,
,