Ridgway: Fat Anchor / 1
Standard Setting via the Fat Anchor
Professor Jim Ridgway
School of Education, University of Durham, UK

Abstract: The paper begins by considering some criteria for the design of assessment systems, such as validity, reliability and practicality. Components of assessment systems in educational contexts are considered, emphasising the design of tasks, conceptual frameworks and calibration systems. If assessment is to be ‘authentic’ on most definitions of education, new tasks must be given on each testing occasion. This gives rise to problems comparing the performances of students on different testing occasions. A new solution to this problem is offered in the form of the Fat Anchor. This is a super test, components of which are administered to samples of students on each testing occasion. Performances on adjacent testing occasions are compared by reference to performance on the Fat Anchor. Initial work in the USA to establish the Fat Anchor in mathematics both for year to year comparability at the same age level, and also across age levels (‘vertical scaling’) is described.

Note: The ideas in the paper apply to all aspects of educational testing. Throughout this document, testing mathematical attainment will be used to exemplify principles.

1.Criteria for Assessment Design

As with most areas of design, the quality of the design can be judged in terms of 'fitness for purpose'. Tests are designed for a variety of purposes, and so the criteria for judging a particular test will shift as a function of its intended function; the same test may be well suited to one purpose and ill-suited to another. A loose set of criteria can be set out under the heading of 'educational validity'. Educational validity encompasses a number of aspects which are set out below.

Construct validity: there is a need for a clear statement of what is being assessed, which aligns with informed ideas in the field, and for supporting evidence on the conceptual match between the domain definition and the tests used to assess competence in that domain.

Reliability: tests must measure more than noise.

Practicality: few designers work in arenas where cost is irrelevant. In educational settings, a major restriction on design is the total cost of the assessment system. The key principle here is that test administration and scoring must be manageable within existing financial resources, and should be cost effective to the education of students.

Equity: equity issues must be addressed - inequitable tests are (by definition) unfair and undesirable.

2.Components of Tests in Educational Settings

Test systems have a number of components, which include:

  • tasks (the building blocks from which tests are made);
  • conceptual frameworks (descriptions of the domain, grounded in evidence);
  • tests (which are assemblies of tasks with known validity and reliability in the critical score range);
  • test administration systems;
  • calibration systems (which are ways to look at standards);
  • cultural support systems (which are ways in which information about tests is communicated to the educational community), materials which allow students to become familiar with test content, materials which support teachers who are preparing students to take the test, and so on.

Here, we focus mainly on tasks, conceptual frameworks and calibration systems.

3.the Design and Development of Tasks

There is a need to create tasks, assemble tasks into tests which 'work', check out what they really measure, then use them. This process involves a number of different phases. For tests appropriate for educational uses, the nature of the tasks needs to be considered carefully. The emphasis on a myriad of short items which characterizes many psychometric tests is quite inappropriate. They are inappropriate for a number of reasons. The most obvious ones derive from considerations of educational validity. A core activity in mathematics is to engage with problems which take more than a minute or so to complete, and which require the solver to engage in extended chains of reasoning. A second conceptual demand is that students make choices about which aspects of their mathematical knowledge to use, and are required to show that they can integrate knowledge from different aspects of mathematics, for example, across algebra and geometry. Short items which atomize the domain fail to address these demands. While there is a clear case to be made that short items which assess technical knowledge have a place in mathematics tests, there are considerable dangers in an over-dependence on tasks of this type. Tests which are made up exclusively of short items not only violate current conceptions of the domain they set out to assess, but also run into the problem of combinatorial explosion - if the domain of mathematics is defined as a collection of technical skills, a very large number of tasks is required to sample the different aspects of mathematical technique.

4.The Design and Development of Tests

4.1Responding to different design briefs

Tests serve a variety of functions which are more or less easy to satisfy. Putting students into a rank order is usually easy, but setting cut points reliably (for pass/fail decisions; to identify high flyers reliably) is more difficult. Monitoring standards over time raises considerable conceptual problems (discussed below); despite these conceptual difficulties, there are often political pressures for the educational community to determine whether standards are rising or falling over time.

4.2The Design and Development of Conceptual Frameworks

Test development should be associated with the development of a description of the conceptual domain, and an account of the ways that the choice of tasks in any test reflects the domain being sampled. In psychometrics, the plan used to support the choice of test items is commonly referred to as the test 'blueprint'. Test developers differ a good deal in terms of the explicitness of their blueprint, its openness to public scrutiny, in terms of the explicitness of the match between specific test items and the blueprint, and in terms of the extent to which the match is tested psychometrically.

The MARS group ( has developed a 'Framework for Balance' for mathematics tests. This is a conceptual analysis of the domain of mathematics, which supports the design of tests, given a collection of tasks. Such frameworks also support conversations and negotiations with client groups for whom assessment systems are designed. No short test can hope to cover every aspect of a substantial domain; tests almost always sample from the domain. The key task for test constructors is to ensure reasonable sampling across core aspects of the domain.

With any conceptual framework, there is a need to see if it is grounded in any psychological reality. If one decides on theoretical grounds that a group of test items measure essentially the same aspect of mathematics, one would be dismayed to learn that student performance on one of these items was quite unrelated to performance on other items. it follows that the theoretical accounts should be validated against hard data. There are a number of ways to do this, most obviously using structural equation modelling (SEM) or via confirmatory factor analysis (a relevant special case of SEM). The essential task is to force the evidence from student performance on different tasks into the conceptual structure, then to see if the fit is acceptable.

4.3Test Validity and Test Calibration

Validity is a complex topic which encompasses a number of dimensions (Messick, 1995; Ridgway and Passey, 1993). The focus here is on construct validity. A core problem arises from the fact that any test samples student performance from a broad spectrum of possible performances. The construct validity derives directly from the tasks used. In areas where the domain definition is rich, this is particularly problematic, because sampling is likely to be sparse.

One strategy is to design a test with a narrow construct validity - for example by focussing on mathematical technique - and then using this narrowly defined test as a proxy for the whole domain, with an appeal to 'general mathematical ability'. One might even produce evidence that technical skills are strongly correlated with conceptual skills, in education systems unaffected by narrow testing regimes. Introducing proxy or partial measures is likely to lead students and teachers to focus on developing a narrow range of skills, rather than developing skills across the whole domain. This is likely to lead to changes in the predictive validity of the test, because some students will have developed good conceptual skills, and other students rather poor ones, depending on the extent to which teachers have chosen to teach to the test.

Another way to approach the problem is to design the tests given in successive years in such a way as to ensure a great similarity in test content. The extreme version of this is to use exactly the same tests in successive years. A related method is to define a narrow range of task types, and to create equivalent tasks by changing either the surface features of the task ('in a pole vault competition'… becomes 'in a high jump competition'…) or to change the numbers (3 builders use 12 bricks to… becomes 4 builders use 12 bricks to…). This might be appropriate if the intended construct validity is narrowly defined. An example might be a test of mathematical technique in number work designed to assess student performance after a module designed to promote good technique. The approach of parallel tests or equivalent items is deeply problematic if the intended construct validity is rich, and involves the assessment of some process skills, such as problem solving, as is commonly the case in educational assessment.

The notion of task (or item) equivalence or parallel test forms might be useful in settings where students take just one form exactly once. When students (and their teachers) are re-exposed to a task designed to assess problem solving, the task is likely to fail as a measure of problem solving simply by virtue of the fact of prior exposure. Repeated exposure changes a novel task into an exercise. Once the task becomes know in the educational community, it will be practised in class (which may well have a positive educational impact in that it might develop some aspects of problem solving behaviour) and so looses its ability to assess the fluid deployment of problem solving skills when used as part of a test. Paradoxically, the more interesting and memorable the item, the less suited it will be for repeated use in assessment (and the more useful it will be when used as a curriculum item). It follows that repeated use of the same test, and the use of parallel forms of a test, is likely to change the construct validity of a test. Therefore, if one wishes to create tests whose construct validity matches some general educational goals (such as assessing mathematical attainment), then tests need to be constructed from items sampled across the domain in a principled way; each year, new tests must be designed. if this can be done, then users can only prepare for the test by deepening their knowledge across the whole domain, not just a selective part of it.

Using new tasks each year has some benefits in terms of educational validity, because each year, tests can be exposed to public scrutiny. This openness to the educational community means that teachers and students can understand better the demands of the system, and can respond to relative success or failure in adaptive ways, notably by working in class in ways like to improve performance on the range of tasks presented. The converse approach of having 'secure' tests encourages a conception of 'ability' which is not grounded in classroom activities, and suggests no obvious course of action in the face of poor performance.

The use of new tests each year helps solve the problem of justifying the construct validity of tests used to sample a broad domain, but poses two major problems: first that the construct validity needs to be reassessed each year; and second that it becomes difficult to make judgements about the relative difficulties of tests given in successive years, and therefore to judge the effectiveness of the educational system in terms of improving student performance year by year.

It follows that the calibration of tests must be taken seriously.

5.Anchoring

Anchoring is a process designed to allow a comparison of scores from different tests to each other. There are two dominant approaches to anchoring: one relies on expert judgement (Angoff procedure; Jaeger’s Method); the other on statistical moderation (e.g. Rasch scaling). More sophisticated approaches use a combination of the two (e.g. Bookmarking; Ebel’s Method). Here, we focus mainly on the ways that statistical moderation might help the awarding process.

The notion of an anchor test (or sometimes even a single task) is straightforward. If one wishes to calibrate two different tests given on different occasions, or to different groups of students on the same occasion, one might use an 'anchor' test which is taken by all students. The relative difficulties of the two tests can be compared by judging their difficulty relative to the anchor. Anchors can take a variety of forms: a single item might be included in both tests; a small anchor test might be administered alongside both the target tests, or a test of some general cognitive ability might be given on both occasions.

Consider the situation where a novel mathematics test is set each year. A problem arises because individual tasks will differ in difficulty level if taken by the same set of students. It would be unfair to set grades using raw scores alone, because of these differences. Suppose that tests are created anew each year, based on the same test blueprint. Suppose that the raw scores of all students are obtained together with a measure of IQ for two adjacent cohorts. One would expect that (other things being equal) students with the same IQ would receive identical scores on the two test administrations, and so any observed differences can be accounted for in terms of the differences in task difficulty, not student attainment. It follows that the scores on the second test can be adjusted in the light of the scores on the first test, and the light of the relative IQs of students in both samples. This anchoring method is directly analogous to norm referencing, using a covariate (IQ) to adjust the norms.

The method depends critically on three assumptions:

  • first, that the two tests have the same construct validity (this might be achieved, for example, by using the same test blueprint, and comparing the resulting psychometric structure of each test);
  • second that the anchor test is plausibly related to this construct validity (if IQ were used, there would be an appeal to some generalised cognitive functioning);
  • third that no improvements have occurred in the education system which lead to genuine improvements in student performance.

This use of anchoring is plausible when applied to the situation described above, and is increasingly implausible as the usage drifts further and further from the model described. Consider two examples which violate the assumptions above. First, suppose we wish to compare the relative prowess of athletes who compete in different events. In particular, we want to compare performance in the 400 metres sprint and in shot putting (as might be the case in the heptathon). We use performance in the 100 metres sprint as an anchor. Let us assume (plausibly) that people who excel in sprinting 400 metres also excel in sprinting 100 metres, and that there is little relation between performance on the shot put and in sprinting. If we use the 100 metres as an anchor, we will find that the median performer 400 metres is given far more credit than the median performer in the shot put, because the scores on the anchor test are far higher for this group. In this example, the procedure is invalid because the two performances being compared are not in the same domain, and because the construct validity of the anchor test is aligned with just one of the behaviours being tested. This situation will apply in educational settings where the construct validity of the tests being anchored is different; it will also apply to situations where the anchor test is not aligned with one or both of the tests being anchored.

For the second example, we return to the example of using IQ to anchor mathematics tests. IQ measures are designed to be relatively stable over time. Suppose that the tests in adjacent years are 'parallel' in the sense of having identical construct validity and identical score distributions. Suppose a new curriculum is introduced which produces large gains in mathematical performance which is reflected in much higher scores for the second cohort of students than the first. The anchoring model will assign the same grade distribution to both sets of students, and will not reflect or reward the genuine gains which have been made. This second example shows that the simple anchoring model is insensitive to real improvements in teaching and learning; it attributes all such gains to changes in task difficulty.

An alternative approach where the anchor itself is used to make judgements about learning gains can be problematic for two sorts of reasons. First that answers to the key question about rising or falling standards are resolved by a surrogate measure - and if one were ill-advised enough to use IQ, one that has been designed to be resistant to educational effects. Second, in the case of domain relevant, short anchors, changes are judged relative to a measure which is likely to be the least reliable measure to hand, because of its length.