Trends (1995-2000) in the TIMSS Mathematics

Performance Assessment in the Netherlands

Pauline Vos and Wilmad Kuiper

University of Twente, Faculty of Educational Science and Technology, Netherlands

e-mail:

Paper presented at ECER 2001, 5-8 September 2001, Lille, France

Abstract

In TIMSS-95, participating countries could administer the TIMSS Performance Assessment as an addition to the standard written test. This innovative test consisted of practical, investigative tasks. Dutch curriculum experts considered this test to fit well with the Dutch mathematics curriculum, which is based on Realistic Mathematics Education (RME). This new curriculum stressed applicability, skills and coherence.

But Dutch students did not score as expected on the practical test. In stead of being at top level (like they were on the theoretical written test in TIMSS), they scored near the international average. This result gave rise to the assumption that teachers were still partly following the old curriculum, which had a more theoretical base than the new curriculum. Probably teachers were still inexperienced with the innovated curriculum. It was assumed that students would do better after a few years, when the new curriculum would have gained impact.

In 2000, the TIMSS Performance Assessment was repeated in the Netherlands. The project included data collection at the levels of intended, implemented and attained curriculum. Trend data (1995–2000) at the intended curriculum level, show a continuous high approval of the test by curriculum experts. Nevertheless, Dutch students’ achievements on the TIMSS Performance Assessment show little gain. Their score on the mathematics task did not improve.

What remains to be noted, is the increased teachers' acceptance of the test. In 2000, there were more mathematics teachers than in 1995, who stated that they would include tasks from the Performance Assessment into their own testing.

Introduction

TIMSS (Third International Mathematics and Science Study) is a large scale, international comparative study in science and mathematics education. It is conducted under the auspices of the IEA (Association for the Evaluation of Educational Achievements). TIMSS aims at providing nations with comparative data to gain insight into the achievements of their educational systems.

In 1995, the Netherlands participated in TIMSS for their grade 8 students, although mathematics curriculum experts objected to the nature of the written mathematics test. This test was considered too traditional and not well compatible with the Dutch mathematics curriculum. Only approximately 70% of the TIMSS-items were covered by the Dutch intended curriculum with the remaining 30% being considered inappropriate. These figures contrast with the coverage by the intended curriculum of other countries. Many countries reported coverage of more than 90% of the TIMSS-items by their curriculum. A few countries (e.g. the USA and Hungary) even reported a 100% match of all TIMSS items with their curriculum (Bos & Vos, 2000; Vos & Bos, 2001).

These figures give a first indication of how the Dutch mathematics curriculum differs from curricula in other countries. The Dutch mathematics curriculum is based on the principles of Realistic Mathematics Education (RME), which was introduced by Hans Freudenthal (Freudenthal, 1973). Apart from a different content, the conceptual approach in this RME-based curriculum differs from the curricula of other countries. Dutch students learn mathematics starting from contexts. In RME, the mathematical activities are named mathematising (Treffers, 1987). The contexts and mathematics are connected through mathematising activities like modelling and interpreting. Contexts can be taken from the physical world, but also from imagined reality (cartoons, games, etc). Connected to this curriculum is a pedagogy of authentic learning (Roelofs, Franssen, et al., 1999).

The TIMSS Performance Assessment in 1995

Besides the standard written test in TIMSS, countries were also enabled to carry out an optional international Performance Assessment for grades 4 and 8. The practical investigative tasks of this Performance Assessment were considered to complement the written test with a higher focus on practical skills and a lower focus on knowledge reproduction. This test was developed from the vision in science education that seeks coherence between procedural, declarational and conditional cognition. Practicals are no longer seen as mere illustrations of taught concepts. Students are expected to investigate systematically, contrary to cookbook-practicals. Seeking explanations for explored phenomena is then part of assessment. As for mathematics, the Performance Assessment is associated with Gal’perin’s view of learning by doing in which mental acts (manipulating objects in the mind) develop from material acts (manipulating tangible objects) (Harmon, Smith et al., 1997; Kind, 1999; Garden, 1999).

Table 1. Tasks in TIMSS Performance Assessment.

Tasks / Investigation
M1 Dice / Identifying a pattern
M2 Calculator / Finding a number pattern
M3 Folding / Just one cut should make a certain figure
M4 Around the bend / Investigating furniture to pass through a corridor
M5 Packaging / Creating a box holding 4 table tennis balls
S1 Pulse / Measuring change in heartbeat when exercising
S2 Magnets / Identifying the stronger one of 2 magnets
S3 Batteries / Identifying weak and good batteries
S4 Rubberband / Extrapolation of stretching
S5 Solutions / Relation between temperature and dissolving
G1 Shadows / Investigating the size of the shadow of an object
G2 Plasticine / Creating lumps of certain weight with an uncalibrated balance

The tasks of the Performance Assessment are stated in Table 1. There are 12 tasks: five mathematics tasks, five science tasks and two combined mathematics/science tasks. These tasks cover educational aims like doing small investigations, measuring, representing data, finding patterns, reasoning. Dutch curriculum experts, who were consulted in 1995, valued this additional test because it matched well with the Dutch intended curriculum. Starting from experiments and investigations as contexts, students would show their mathematisating skills, which they acquired through the new curriculum. Thus, the Netherlands participated in the Performance Assessment, but only for grade 8 (Harmon, Smith et al., 1997).

In 1995, there were 18 countries that participated both in the written TIMSS test and in the Performance Assessment for grade 8. Table 2 shows the resulting rankings of both tests. Most countries reached a position in the international comparison on the practical Performance Assessment, which had a comparable magnitude as the position in the more theoretical written test. The Netherlands was a marked exception here. Despite the fit of the TIMSS Performance Assessment with the Dutch intended curriculum, Dutch grade 8 students did not score as expected. Unlike on the written test, their achievement was at the level of the international average and not significantly above.

Therefore, an understanding of Dutch students' achievements was needed. These achievements were in line with the previous, abandoned curriculum, which was more theoretical. Probably, mathematics teachers were still directed towards this former curriculum. Therefore, it was asserted that TIMSS in 1995 had come too early. The new RME-based curriculum was only introduced in 1993. At some schools the textbooks had not yet been replaced. And maybe teachers had not yet had time enough to adopt their instruction to the new curriculum. Thus, if the TIMSS Performance Assessment could be replicated at a later stage, trend data could establish whether the new curriculum was starting to settle.

Table 2. Ranking of 19 countries in 1995 on two TIMSS mathematics tests, copied from Bos, Kuiper & Plomp (2001).

TIMSS Written Test
Score on mathematics items / TIMSS Performance Assessment
Score on 7 mathematical tasks
Country / Score
points / Country / Avg % correct
1 / Singapore / 643 / 1 / Singapore / 70
2 / Czech Rep / 564 / 2 / Switzerland / 66
3 / Switzerland / 545 / 3 / Australia / 66
4 / Netherlands / 541 / 4 / Romania / 66
5 / Slovenia / 541 / 5 / Sweden / 65
6 / Australia / 530 / 6 / Norway / 65
7 / Canada / 527 / 7 / England / 64
8 / Sweden / 519 / 8 / Slovenia / 64
Intl average / 509 / 9 / Czech Rep / 62
9 / New Zealand / 508 / 10 / Canada / 62
10 / England / 506 / 11 / New Zealand / 62
11 / Norway / 503 / 12 / Netherlands / 62
12 / USA / 502 / 13 / Scotland / 61
13 / Scotland / 498 / Intl average / 59
14 / Spain / 487 / 14 / Iran / 54
15 / Romania / 482 / 15 / USA / 54
16 / Cyprus / 474 / 16 / Spain / 52
17 / Portugal / 454 / 17 / Portugal / 48
18 / Iran / 428 / 18 / Cyprus / 44
19 / Colombia / 385 / 19 / Colombia / 37
Design of the study

The research was carried out using IEA's conceptual curriculum framework, distinguishing three curriculum appearances as stated by Robitaille, Schmidt, et al. (1993):

- the intended curriculum, at the system level

- the implemented curriculum, at the classroom level

- the attained curriculum, at the students' level.

The repeat of the Performance Assessment was planned for 2000. Unfortunately no other countries were interested in participating. The study was guided by two research questions:

- To what extent will a repeat of the TIMSS Performance Assessment in grade 8 in 2000 result in the same students' mathematics achievements as in 1995?

- To what extent can trends in the achievements on the Performance Assessment be ascribed to the fit (or lack thereof) with the intended mathematics curriculum and implemented mathematics curriculum (the factual delivered instruction in the classrooms)?

To operationalise the attained curriculum, the students’ achievement on the TIMSS Performance Assessment was taken. The test would be repeated by copying all protocols of 1995. Although not needed for analysis, test circumstances were kept equal and the five science tasks were maintained. Also, coding of students’ answers and data processing followed the 1995 procedures.

To operationalise the two other curriculum levels, two instruments were developed. For the intended curriculum, mathematics curriculum experts (curriculum developers, researchers) were asked to indicate for each test item whether its content was covered by the educational aims and objectives as stated in the official educational documents for grade 8. In TIMSS, this Test Curriculum Matching Analysis (TCMA) is carried out in all participating countries for the written test. For the Netherlands, a similar instrument was developed for the tasks of the Performance Assessment.

For the implemented curriculum, the mathematics teachers of the tested classes were asked to judge each item. The questions were twofold:

  1. Has the content of this item been taught to the class?
  2. Independent of the answer to question 1, assuming that you would set a Performance Assessment, would you include this item?

This instrument to measure Opportunity to Learn (OTL) was based on research by De Haan (1992).

Data reliability and comparability

Considering the reliability of results, coding of students’ answers on open items proves to be an ambiguous task. As already pointed out in the international 1995 TIMSS Performance Assessment report (Beaton, Mullis et al., 1997), inter-scorer agreement can vary considerably. Especially the coding of borderline answers (which are neither totally correct nor totally incorrect) is conditional to the coders' background (e.g. coding experience, subject matter knowledge, teaching experience, etc) (Zukovsky, 1999).

Comparability of testing circumstances of some tasks in the Performance Assessment proved problematic. Although test instruments of 1995 were copied in 2000, there were minimal mutations in the laboratory equipment that students used. Two examples will illustrate the effect:

-In the task “Shadows” a torch is used. The torch used in 1995 gave a vague shadow, while the torch of 2000 gave a sharper edge to the shadow. The latter made student’s measurements easier giving them more time for the remaining items in the task.

-In the task “Plasticine” in 1995 a delicate metal balance was used, while in 2000 a handy plastic balance was used. Again, the latter made the task more comfortable, giving students ample time for reflective activities.

To eliminate unreliable and incomparable results, two tests were carried out. First, for each task, Cronbach's alpha was calculated for 1995 and 2000 separately. Results higher than 0.6 were considered acceptable. Second, a chi-squared test was carried out, revealing that answer patterns differed as a result of these altered testing circumstances. Here, values higher than 5 were considered acceptable. The results are shown in Table 3.

Table 3.

TIMSS Performance Assessment 2000, reliability (Cronbach alpha) and comparability (χ2-test) of students' scores per task.

Task / number of items / Alpha in 1995 / Alpha in 2000 / χ2-test on scores of 1995-2000
M1 Dice / 6 / 0.30 / 0.64 / 77
M2 Calculator / 7 / 0.71 / 0.68 / 99
M3 Folding / 4 / 0.83 / 0.76 / 53
M4 Around the bend / 8 / 0.59 / 0.62 / 100
M5 Packaging / 3 / 0.61 / 0.65 / 28
S1 Pulse / 4 / 0.61 / 0.59 / 33
S2 Magnets / 2 / 0.65 / 0.36 / 0
S3 Batteries / 4 / 0.54 / 0.68 / 18
S4 Rubberband / 7 / 0.58 / 0.39 / 0
S5 Solutions / 7 / 0.63 / 0.63 / 57
G1 Shadows / 6 / 0.64 / 0.61 / 1
G2 Plasticine / 8 / 0.85 / 0.78 / 0

To avoid distortions of the trend measurement, the tasks with questionable results were omitted. Therefore, all five mathematics tasks (consisting of 28 items) were considered suitable for analysis.

Based on these five mathematics tasks, the initial Dutch agitation about the disappointing test results dwindled slightly. Especially the results of the “Plasticine” task had been pulling down the overall results. Omission of the unreliable results raised the Dutch position slightly above the international average, illustrating a deficit in stability in the international results of 1995. Still, this did not uplift the point that the achievements were still in discrepancy with the intended curriculum.

Research results

For the Performance Assessment of 2000 in the Netherlands, a sample of n=234 students at 27 schools was realised. In 1995, n=437 students at 48 schools had been tested. At the implemented curriculum level, 20 mathematics teachers (a response of 74%) completed the OTL-instruments. At the intended curriculum level, three independent mathematics curriculum experts, working at an institute for curriculum development, an institute for educational testing & measuring and a national centre for school improvement completed the TCMA instruments.

The achievement of Dutch students on the repeat of the Performance Assessment in 2000 is given in Table 4. For each task the average percentage of correct scores on the items is calculated. Compared to 1995, the scores do not show any significant changes on the mathematics tasks. The average score correct of 66 on the five mathematics tasks in 1995 had become an average score of 67 in 2000 (this answers research question #1). On each task separately, the shifts were statistically insignificant.

Table 4. Performance Assessment 2000, average score percentage correct per task, comparison 1995-2000

1995
(n=437) / 2000
(n=234)
M1 Dice / 77 / 74
M2 Calculator / 62 / 59
M3 Folding / 73 / 77
M4 Around the bend / 68 / 69
M5 Packaging / 52 / 54

Subtotal mathematics

/ 66 / 67
S1 Pulse / 45 / 51
S3 Batteries / 64 / 70
S5 Solutions / 51 / 60*

Subtotal science (3 tasks)

/ 53 / 62*

Total (8 tasks)

/ 61 / 64*

* significant trend difference

For comparison, also results for the three science tasks are presented. Here, a statistically significant improvement is shown on the chemistry task "Solutions". Chemistry was introduced as a new topic in the Dutch core curriculum for junior secondary school in 1993. Apparently, this move seems to have a considerable effect. Also, the other two science tasks show improvement in students’ achievement. Therefore, it is remarkable, that there is no progress on the mathematics tasks.

As for the curricular appropriateness of the tests, curriculum experts stated that on average 83% of the items in the Performance Assessment was covered by the intended curriculum (confirming the 1995 results). Similar to 1995, the Performance Assessment items received a higher rate of approval than the TIMSS written test, which had a coverage percentage of 71%, by the intended curriculum.

These results confirm the data of the 1995 research and do not reveal a trend. Students’ achievements on the practical mathematics tasks have not changed, nor did the judgement of the curriculum experts.

Table 5. TIMSS Performance Assessment 1995-2000, average judgement on items of tasks by mathematics teachers (tasks M1 to M5)

1995
(n=19) / 2000
(n=20)
Content covered / 39 / 63
Include into a test / 51 / 79

Data of teacher judgement at the implemented curriculum level are given in Table 5. With the small number of teachers, these data have a limited value. Still, there show a notable trend. On the question whether the content of the items were covered in the lessons, there was a notable increase from 1995 to 2000, with an increase from 39% to 63% of the five mathematics tasks on average being covered. Still, this figure is low when compared to the 83% matching percentage of the same test by the intended curriculum. Yet, mathematics teachers’ inclination to incorporate tasks from the Performance Assessment into a test of their own making has also increased. In 1995, teachers tended to accept on average 51% of the items of the practical mathematics tasks. This number had increased in 2000 to 79% of the items. Therefore, it seems that the implemented curriculum is approaching the intended curriculum. However, the slight change in approval of mathematics teachers of the investigative tasks of the TIMSS Performance Assessment has not yet translated into a higher student score. The above data at the implemented curriculum could reflect teachers' ideas, and not their enactment. This could mean, that it could take more for teachers' intentions to be put into practice (Fullan & Stiegelbauer, 1991).

The TIMSS Performance Assessment as exemplary practice

Although the mathematics curriculum for junior secondary schools in the Netherlands was totally overhauled in 1993, national assessment stuck with time-restricted, written tests. Only the content of tests was adjusted to the new content, with each test item being linked to a context. De Lange (1987) explained that alternative testing methods (interview, portfolio, observation, essay or take-home tasks) had been options but were rejected because these could not readily be carried out in class. This situation illustrates the statement by Niss (1993), that assessment in mathematics education in many countries stays behind with instructional and other reforms.

The TIMSS Performance Assessment proved to be an eye-opener to many Dutch mathematics teachers. During the testing sessions, they observed the tasks and how their students coped with these. Some teachers admitted that they had never thought mathematics could be tested in a practical experience. As such, the TIMSS Performance Assessment could prove to be part of the exemplary material that is needed to support curriculum reform (Van den Akker, 1998).

Conclusion

In the repeat of the TIMSS Performance Assessment in the Netherlands at grade 8 level, data were collected at the levels of intended, implemented and attained curriculum. Trend data (1995–2000) at the intended curriculum level, show a continuous high approval of the test by curriculum experts. The test seems to match well with the aims of the innovative RME-based intended curriculum.