08/11/01

The Influence of Test Title on Performance

in a Test of Developed Abilities in Mathematics

William Hodgson

Paper presented to the British Educational Research Association Annual Conference, University of Leeds, 13-15 September 2001

Abstract

The effect of different test titles (mathematics versus problem solving, and hard versus easy) on performance, and gender differences in performance, was investigated in a randomised field experiment. The sample consisted of 4600 students, between 16-18 years of age, studying for the GCE A level examination in the United Kingdom. Test title did appear to effect performance. Students generally achieved higher scores when the test was described as hard rather than easy. The effect of test title differed with gender and student ability. More able female students scored higher when the test was described as hard and problem solving, but not when the test was described as hard and mathematics. These more able females actually achieved the same score on the test described as hard and problem solving as that achieved by males on the test described as easy and problem solving, a gender difference of zero.

ADesign of the Experiment

The instrument used to measure developed ability in mathematics was the ‘International Test of Developed Ability in Mathematics’ (ITDAM). The ITDAM test consisted of 35 questions and 25 minutes was allowed for completion of the test. Each of the questions was of multiple choice type and many of the questions were similar in presentation, style, and content to some GCE A-level mathematics questions. The ITDAM test is considered in more detail in Section F.

The design of the experiment involved the modification of the title and description of the ITDAM test to give four different versions. The administration and content of the test remained unchanged. The front page of the test was modified to give the impression that the test was respectively:- mathematical and hard (MH), mathematical and easy (ME), problem solving and hard (PH) and finally problem solving and easy (PE). For example, the rubric was amended to that shown below for the test described as mathematical and hard:

Test Description

TEST OF DEVELOPED ABILITY

Mathematical Test

The questions are meant to be quite HARD, but you are not expected to finish the test. Please read the questions carefully.

The words mathematical and problem solving were exchanged and the words hard and easy were exchanged to give the four different versions of the test.

It was anticipated that describing the tests as mathematical and hard would depress achievement of both sexes, but more so for females, and hence that the gender difference would increase. It was also expected that this description might suppress the use of iterative/estimation approaches to solving the problems. This was based on the assumption that students might assume that such processes would be inappropriate to a ‘hard test of mathematics’. Iterative/estimation techniques might be more likely when the test was described as problem solving.

The experiment took place in 1993. Students were administered one of the four versions of the same test as part of a process of data collection for the A Level Information System.

BSample

The following analysis involved about 5000 students who sat A-level examinations in 1993. After listwise deletion of cases with missing variables the sample size reduced to 4606. The four alternative versions of the test were randomly administered to the sample. The numbers of students taking each version of the test are shown in Table 1a.

Table 1a: Sample Composition

Title Key Words /
Sample
Maths/Hard / 1148
Maths/Easy / 1290
Problem Solving/Hard / 1024
Problem Solving/Easy / 1144

CResults

Table 2a summarises the scores achieved on each of the four versions of the test. The overall average score on the test was 16.1 with scores varying between 15.8 and 16.5 for the different versions of the test. Males achieved higher scores than females on average. The effect of different titles on performance on the ITDAM test was examined by carrying out an ANOVA analysis on test performance, with sex and title of the test as main effects. The results of the experiment are presented in Table 2b.

Table 2a: Results on the ITDAM by Test Title and Sex

Population / Males / Females
Maths/Hard / 16.1 / 17.8 / 14.5
Maths/Easy / 16.0 / 17.8 / 14.6
Problem Solving/Easy / 15.8 / 17.6 / 14.3
Problem Solving/Hard / 16.5 / 18.1 / 15.1

Table 2b: Test Performance by Title and Sex

Source of Variation / Sum of Squares / DF / Mean Square / F / p
Hard-Easy (H) / 79.27 / 1 / 79.27 / 2.57 / 0.11
Math-Prob (I) / 9.69 / 1 / 9.69 / 0.31 / 0.56
Sex (S) / 11450 / 1 / 11450 / 370.75 / <0.01
Interaction (H*I) / 126.90 / 1 / 126.90 / 4.10 / 0.04
Interaction (H*S) / 1.89 / 1 / 1.89 / 0.06 / 0.81
Interaction (I*S) / 0.92 / 1 / 0.92 / 0.03 / 0.86
Interaction (H*I*S) / 16.60 / 1 / 16.60 / 0.54 / 0.46
Error / 141998 / 4598 / 30.88

There was no three way interaction effect between sex and the two title dimensions on the front of the test, but both the effect of sex and the interaction of the test title dimensions were significant (p<0.01 and p=0.04 respectively). Males achieved higher scores, on average, than females (17.8 and 14.6 respectively, corresponding to an effect size of -0.58 and a raw score difference of 3.2 or about 20%). The magnitude of the gender difference was similar to that reported previously for the ITDAM test (see Hodgson, 1995). Figure 1, below, shows the scores on each of the four tests for the sample as a whole.

Figure 1: Scores on the ITDAM Test by Title


Both sexes achieved more positively on the version of the ITDAM test that was referred to as testing problem solving and being hard. Performance on the other three versions of the test was about half a mark lower than the hard-problem solving test, equivalent to an effect size of about 0.1, a small but significant difference, corresponding to about 20% of the gender difference.

Student Ability and Test Performance

The results of the ITDAM test were used to band students into three categories, less able (those scoring less than 13), intermediate ability (those scoring 13-18) and more able (those scoring 19 or more). The effect of the two test title dimensions was then considered by ability level. The rational for this was to consider if the effect of test title might differ for students with different levels of developed mathematical ability. Table 3a gives a breakdown of the sample composition and Table 3b summarises the scores achieved by each of the three ability groups.

Table 3a: Sample Composition by Ability

Less Able / Middle 30% / More Able* / Totals
Maths/Hard / 331 / 450 / 367 (248/119) / 1148
Maths/Easy / 380 / 511 / 399 (248/151) / 1290
Problem Solving/Easy / 304 / 428 / 292(191/101) / 1024
Problem Solving/Hard / 293 / 475 / 376 (242/134) / 1144
Totals / 1308 / 1864 / 1434 / 4606

* The numbers in brackets refer to the more able males and females respectively.

Table 3b: Results on the ITDAM by Test Title and Ability

Less Able / Middle 30% / More Able
Maths/Hard / 9.5 / 15.4 / 23.0 (23.4/22.1)
Maths/Easy / 9.6 / 15.4 / 22.9 (23.2/22.5)
Problem Solving/Easy / 9.7 / 15.4 / 22.6 (22.9/21.9)
Problem Solving/Hard / 9.6 / 15.4 / 23.3 (23.5/22.9)

* The numbers in brackets refer to the more able males and females respectively.

The effect of the test title was more pronounced for the most able students, corresponding to about 30% of the sample, and test title made essentially no difference to the performance to the other two groups of students.

Table 3c: Test Performance by Title, Sex and Ability Level

Source of Variation / Sum of Squares / DF / Mean Square / F / p
Hard-Easy (H) / 8.99 / 1 / 8.99 / 1.41 / 0.24
Math-Prob (I) / 3.61 / 1 / 3.61 / 0.57 / 0.45
Sex (S) / 196.96 / 1 / 196.96 / 30.85 / <0.01
Ability Band (G) / 104910 / 2 / 52455 / 8215 / <0.01
Interaction (H*I) / 18.53 / 1 / 18.53 / 2.90 / 0.09
Interaction (H*S) / 19.64 / 1 / 19.64 / 3.08 / 0.08
Interaction (H*G) / 26.40 / 2 / 13.20 / 2.07 / 0.13
Interaction (I*B) / 1.87 / 1 / 1.87 / 0.29 / 0.59
Interaction (I*G) / 2.11 / 2 / 1.06 / 0.17 / 0.85
Interaction (S*G) / 95.53 / 2 / 47.77 / 7.48 / <0.01
Interaction (H*I*S) / 3.74 / 1 / 3.74 / 0.59 / 0.44
Interaction (H*I*G) / 37.84 / 2 / 18.92 / 2.96 / 0.05
Interaction (H*S*G) / 8.86 / 2 / 4.43 / 0.69 / 0.50
Interaction (I*S*G) / 10.78 / 2 / 5.39 / 0.84 / 0.43
Interaction (H*I*S*G) / 17.26 / 2 / 8.63 / 1.35 / 0.26

More detailed consideration of the results for the most able 30% of students showed that performance was greatest on the hard-problem solving test (23.3) and least on the easy-problem solving test (22.6), see Table 3b. A four way analysis of variance, comprising the two test title dimensions and sex and ability level, was carried out and these results are given in Table 3c. There was no four way interaction effect. The three way interaction effect between the two test title dimensions and ability was significant (p=0.05).

The final analysis considered gender differences in test performance amongst the most able 30% of students. Test results for this group of students, broken down by sex, were also presented in Table 3b. Inspection of these results suggested that for the two test title dimensions there was a main effect for ‘hard’ for males, but for females there was an interaction effect between the hard-easy dimension and the maths-problem solving dimension.

Table 4a gives the results of a three way ANOVA analysis, between the two test title dimensions and sex, for only the most able 30% of students. The interaction effect between the two test title dimensions was significant (p=0.04), but the interaction with sex was not (p=0.23). Table 4b gives, for the same group of students, the results of two separate ANOVAs for males and females respectively. For males there was a significant main effect for the hard-easy dimension (p=0.08), and for females the interaction effect between the two test title dimensions was significant (p=0.03).

Table 4a: Test Performance for the Most Able 30%

Source of Variation / Sum of Squares / DF / Mean Square / F / p
Hard-Easy (H) / 34.19 / 1 / 34.19 / 2.75 / 0.09
Math-Prob (I) / 0.04 / 1 / 0.04 / 0.003 / 0.96
Sex (S) / 250.76 / 1 / 250.76 / 20.14 / <0.01
Interaction (H*I) / 53.87 / 1 / 53.87 / 4.33 / 0.04
Interaction (H*S) / 3.14 / 1 / 3.14 / 0.25 / 0.61
Interaction (I*S) / 3.04 / 1 / 3.04 / 0.24 / 0.62
Interaction (H*I*S) / 18.27 / 1 / 18.27 / 1.47 / 0.22
Error / 17753 / 1426 / 12.45

Table 4b: Test Performance by Title and Sex

Males / Sum of Squares / DF / Mean Square / F / P
Hard-Easy (H) / 41.47 / 1 / 41.47 / 3.11 / 0.08
Math-Prob (I) / 1.70 / 1 / 1.70 / 0.13 / 0.72
Interaction (H*I) / 6.72 / 1 / 6.72 / 0.51 / 0.48
Error / 12309 / 925 / 13.31
Females / Sum of Squares / DF / Mean Square / F / P
Hard-Easy (H) / 6.40 / 1 / 6.40 / 0.59 / 0.44
Math-Prob (I) / 1.45 / 1 / 1.45 / 0.13 / 0.72
Interaction (H*I) / 51.88 / 1 / 51.88 / 4.77 / 0.03
Error / 5444 / 501 / 10.87

Figures 2 and 3 illustrate the gender difference just described for the most able 30% of the sample as raw scores and effect sizes respectively. The effect sizes are relative to the mean score of the most able group of students (23.0) and used 5.0 as a conservative estimate of the standard deviation.

If the main effect for the hard-easy dimension, seen for males, is taken as the general pattern then the anomalous result would appear to be the performance of females on the test described as hard and mathematics. Describing a test as ‘hard’ might encourage able students to ‘give of their best’, if so, this effect was lost for females when the test was described as mathematical.

From Figure 2 it can be seen that the scores achieved by males on the test described as ‘problem solving-easy’ were the same as the scores achieved by females on the test described as ‘problem solving-hard’. It would appear that the difference of one word has had an effect of similar magnitude to the gender effect. The use of the word hard linked to problem solving appeared to increase the performance of the more able females by about 1 point. However when problem solving-hard is changed to mathematics-hard performance drops back


Figure 2: ITDAM Test Score by Title and Gender for More Able Students


Figure 3: Effect Size by Title and Gender for More Able Students

D Conclusions

These results suggest that the performance of males and females is susceptible to factors such as test presentation. In particular, that simply the use of words, such as hard and easy, and mathematics and problem solving, might be important and influence test performance.

The main conclusions from this experiment are:

  1. That for some groups of students performance on the test of developed ability in mathematics varied with the words used to describe the test. The effect differed with gender and student ability.
  1. That the students performed slightly better when the test was described as being ‘hard’ as opposed to being ‘easy’. The magnitude of the effect was of the order of 0.7 points or 4% between the tests described as ‘problem solving easy’ and ‘problem solving hard’.
  1. For both male and female students the impact of test title was greater for the more able students.
  1. For these more able students the overall magnitude of the effect of test title was greater for female students than for male students. The magnitude of the effect was about 0.6 points, or 3%, for males and about 1 point or 5% for females.
  1. There appeared to be a general effect whereby students achieved higher scores when the test was described as hard as opposed to easy. This was not the case for the more able female students. Females showed the same general trend except that performance appeared to be depressed if the test was described as hard and mathematics.

E Discussion

The effect of test title could have been greater for the more able students because the test was relatively difficult. These more able students would be better equipped to tackle the questions and the test title might influence the students’ motivation. However, weaker students might not have the necessary skills and hence test title might not have such a marked effect on performance in the test for these students.

It may be that use of the word ‘hard’ rather than ‘easy’ had a general motivating effect. If so this appeared to be equivalent to about 1 point for the more able females. The positive effect of describing the test as hard was almost negated for the females when the test was described as mathematics rather than problem solving. Describing the test as mathematics and hard rather than problem solving and hard appeared to suggest that the term mathematics had a de-motivating effect on the females in particular.

Educationally this finding might be important, however, it should be stressed that the magnitude of the effect was relatively small. None-the-less if mind-set can affect performance on a 25 minute test, what might be its cumulative effect over years of study? The ways in which teachers present and describe mathematical problems, including the language they use, might have a profound effect on the performance of students on such problems.

The findings of this experiment would suggest that outcomes on any ability test must be considered in relation to the students perception of the test. In particular, interpretation of the results on tests of (developed) ability should consider the possible effect of the test itself on student achievement behaviours. This might be of particular importance for tests of ‘developed mathematical ability’ given to older students, such as the ITDAM test.

References

Hodgson, W. (1994) PhD Thesis- ‘Gender Differences in Mathematics and Science: A Study of GCE Advanced Level Examinations in the United Kingdom’, School of Education, University of Newcastle upon Tyne.

1W. Hodgson