The ESRC Researcher Development Initiative: promise and pitfalls of pragmatic trials in education

Stephen Gorard and Carole Torgerson

Department of Educational Studies

University of York

Paper presented at the British Educational Research Association Annual Conference, University of Warwick, 6-9 September 2006

Introduction

Governments have become increasingly interested in the quality of evidence regarding the effectiveness of alternative practices, programmes and policies across the public arena (Blunkett 2000). The US Congress’s Committee on Education and the Work Force, for example, has been ‘concerned about the wide dissemination of flawed, untested educational initiatives that can be detrimental to children’ (Boruch and Mosteller 2002, p.1). History suggests that this concern is not misplaced, since there are many examples of public policy interventions that have been widely disseminated on the basis that they are driven by good intentions, seem plausible and are unlikely to do any harm, yet when they have been rigorously evaluated have been found to be ineffective or positively harmful (Petrosino et al. 2000, Carlin et al. 2000). Ineffective interventions may be expensive. For example, an evaluation (using a randomised controlled trial design) of a computer package commonly used in the USA to teach literacy skills found it had no effect when compared with ordinary teaching (Rouse et al, 2004). The cost of this programme was estimated in the region of $92 million, which could have paid for many other interventions. A similar rigorous experiment in the UK found no effect of a software package on literacy learning (Brooks et al, 2006). In fact, of course, most public policy innovations are not rigorously tested at all. The randomised controlled trial (RCT) is often considered to be a very valuable research design in assessing the effectiveness (Torgerson and Torgerson 2001), but it is also widely regarded as problematic for evaluating complex interventions of the kind often encountered in education, criminal justice and the wider social sciences. The response to these concerns has been to support the development of a model for complex interventions, calling for the consideration of a fuller research cycle involving theory and in-depth study as well as the trial itself. This has been used successfully as a basis for trials in education and health promotion, and has important similarities to the more recent ‘design experiment’ methodology applied to educational innovation (Bannan-Ritland et al. 2006).

A group of researchers based at the University of York have recognised the growing need for wider understanding of the use of trials in public policy, and instituted a supportive collaboration (the York Trials Methods Group) among Departments that are undertaking trials. The group is currently undertaking trials in education, psychology, crime, social work, health studies, and economics. Its collaboration provides mutual support and expertise via meetings and workshops, and has a training function to support the academic development of junior staff within the university. For example, a trials methods course is run annually for postgraduate students and researchers. The same group are behind the planned setting up of the £20 million ‘Institute for Effective Education’ at York. In addition, the group has been tasked, as part of the ESRC Researcher Development Initiative, to offer similar support to social science and public policy researchers on a national basis. Also, by widening our collaboration we increase the likelihood that very large trials involving more than one centre can be undertaken.

The chief purpose of this paper/presentation is to publicise this Initiative and the activities within it, leading to a fruitful discussion both of the concepts involved and of plans for the future. The authors hope that the session will be attended both by students or new researchers seeking freely available professional guidance, and by more senior researchers interested in contributing to an understanding of the promise and pitfalls of conducting trials in public policy. The term ‘trials’ here is understood very widely, and includes randomised-controlled trials, natural experiments, and design studies. Future activities include face-to-face and residential workshops, debates, internet discussions, web-based resources, published protocols in downloadable form, and methodological papers. The intended audience includes research students, novice researchers, mid-career researchers, policy-makers, and research trainers. In order to provide and create these resources, we have assembled a team of experts in the conduct of public policy interventions based across the UK and abroad.

Across a range of social science fields we wish to contribute to the growth of the number of researchers who hold mature and reasonable views on the value of rigorous interventions, who can be appropriately critical and appropriately appreciative of progress in this area. This would be achieved directly by addressing new researchers and their mentors, and indirectly by affecting the future training of new researchers in cooperation with existing research trainers.

Hypothetical illustration

Imagine a study investigating the determinants of a particular illness among humans which focused exclusively on a large number of patients diagnosed with the illness. The study might collect further information, to discover that most of these patients had also grazed their knees while infants. Thus, the study results could be summarised as being that most people with the illness had previously grazed their knees. But, clearly, such a finding is only of any value if it is not also true that most people without the illness had previously grazed their knees. What is missing is a control or comparison group of equivalent people not diagnosed with the illness. A fair comparison group is an essential component of such a claim to scientific knowledge. But despite this, literature searches reveal that a very high proportion of research studies in education focus exclusively on one group, and still try to make substantive comparative claims (Gorard and Smith 2006).

Now imagine that the same study does have a comparison group, and finds that the group with the illness is substantially more likely to report having grazed their knees badly when a child. This is a finding for the two groups, but even this does not mean that there is any coherent relationship between the two characteristics. If the group with the most knee grazes were older it might be because they were also less likely to wear long trousers as a child, and because they are older they have had more time to develop the illness. The apparent relationship would be spurious. Alternatively, if the group with the most knee grazes were younger this could be because they were more likely to remember this early episode than the older group. Age (or time) is here a confounding variable, whose influence must be eliminated in this example. And, of course, age is only one of a very large number of possible confounds – including diet, poverty, sex, or ethnic differences between the two groups. In order to make a fair claim about any tentative link between the illness and knee grazing, we would need to make sure that the comparison group matched the illness group as fully as possible. But even in the rare piece of education research that has a comparison group, little attempt is made to match it to the focus group, and almost no research reports attempt to consider the impact of confounds when drawing conclusions from evidence.

Now imagine that the same study has managed to match the two groups, perhaps by identifying a series of pairs of the same age, SES, diet, sex and ethnicity for each group. And it is still the case that in each pair the one with the illness is more likely to report knee grazing. We still face three major problems before validly claiming a relationship between grazing and the illness. First, if the characteristics of all people that could have been chosen for either group were different, then the comparison is still confounded. For example, if people with the illness tend to be older than people without then matching people by age both with and without the illness disguises this confound but does not eliminate it. Second, we have no valid formal way of deciding how strong the pattern or relationship has to be before it becomes of interest (Gorard 2006a).

Third, even if we accept that there is a relationship we do not know what form it takes. We certainly have no evidence to conclude the knee grazing leads to (i.e. causes) the illness (Gorard 2002a). In order to make a causal claim we would need a stronger form of evidence, of the kind provided by an intervention. Thus, we have outlined the basis for a fair test – an intervention and a control, applied to two equivalent groups in a way that reduces or eliminates possible confounds. The basic design of a trial is a necessary, but not sufficient, requirement to make even simple claims to knowledge.

All of these points are obvious, and would be patronising to discuss (for even many primary school children know this much about a ‘fair test’ from their science learning), except for the fact any review of the education research literature in the UK will show that very few reports ignore none of these points, and that most reports ignore all of these points. Repeated analyses show how few trial-based designs are reported even in the most prestigious outlets (Gorard et al. 2004a, Slavin 2004, Abrami and Bernard 2006) In fact, most studies making comparative claims do not have a comparison group to base their claim on (Gorard et al. 2006). ‘Education is in fact a particularly depressing example [of public policy]. There are thousands of academics working in what is a very highly policy relevant field yet they exercise almost no role in holding government to account either directly or through democratic debate… With a few honourable exceptions the huge academic community has little to contribute’ (Johnson 2004, p.21). The National Strategies (e.g. for numeracy and literacy) have been particularly deficient here (Ruthven 2005).

This unwillingness to challenge policy and practice initiatives through definitive testing means that governments tend to be flippant in the introduction of expensive policies with no clear evidence base, while education researchers tend to be conservative and very cautious in theory development. In fact, the situation should be the other way around (Swann 2005). It matters, because different research methods give different results (Gorard with Taylor 2004). Attempting the ‘definitive test’ phase of the full cycle of research (Figure 1) using a completely different method to a trial, can give very different answers. Using post hoc propensity score matching, for example, would lead to the ‘wrong’ result in comparison to a fair test nearly half of the time (Ty Wilde and Hollister 2002), and ‘qualitative’ impressions such as the reports of teachers can be misleading in isolation (Connolly and Hosken 2006).

Figure 1 – Outline of a full cycle of education research

Adapted from Gorard with Taylor (2004) and Bannan-Ritland et al. (2006), forming the basis for the new Institute for Effective Education at the University of York.

The promise and problems of educational trials

This is why the ESRC has funded a two-year project on the promise of trials in public policy as part of its Researcher Development Initiative. As in US education before the No-Child Left-Behind legislation, a higher proportion of policy-makers and advisers in the UK want (or claim to want) rigorously tested evidence than the proportion of public policy researchers willing and able to provide it. Unlike in US education, a growth in trials is, quite rightly, being attempted by encouragement rather than legislation.

Education is a particularly appropriate area of public policy for creating trial evidence. Trials are easier than in medicine (where they have been very but not universally successful, most particularly in eliminating ineffective treatments). Quite often the treatment (such as a new teaching strategy, admissions code, or funding formula) is compulsory, there are fixed attendance patterns, with students conveniently organised in schools and classes, with excellent official databases at pupil-level, and so on (Styles 2006). The variation and interaction found in the social settings of schools, often presented as a hurdle, does not mean that we cannot say things about the average effects of treatment (Slavin 2004). In fact, large-scale trials with random allocation to treatments are necessary in education precisely because of that variation, not found in some area of natural science, for example (Fitz-Gibbon 2003). Educational experiments, and teaching experiments or design studies, are relatively easy to conduct, even for novice and practitioner researchers (Smith and Gorard 2005, Gorard et al. 2004b). Their results are, if anything, easier to analyse than other research designs, often eschewing complex calculations in favour of transparency, rigour and qualitative judgement (Gorard 2006a, 2006b). As Ernest Rutherford pointed out: ‘If your experiment needs any statistics, you ought to have done a better experiment’ (in Bailey 1967).

Above all perhaps, such approaches are more ethical than using an untested intervention (Gorard 2002b). This realisation is now embodied in the BERA (2004) ethical guidelines, with requirements that education researchers take a disinterested approach to research design, analysis and interpretation, and ‘must employ methods that are fit for the purpose of the research’ (p.10). As the hypothetical illustration above makes clear, a causal claim requires an intervention, and a comparison (Gorard 2002a).

Of course, there are opponents to this conclusion, such as Barrow (2005) who calls any form of scientific research an ‘experiment’, and uses the field of school effectiveness as an extended example of experimental research. Yet this field represents perhaps the antithesis of experimental approaches – the post hoc dredging of sullen datasets. He also rails against the dominance of experimental work in education (see above)! The confusion evident in his argument is clear in his setting up as adversaries, rather than allies, those ‘who believe that our educational practice needs to be based on more empirical research… and those who are sceptical of the value of such research’ (p.13). It is, more than anyone, the sceptics of existing work who are calling for more and better empirical research (in Gorard 2002c). The presence of this antagonism to science (really just a synonym for research) is growing in HE according to some commentators (Steuer 2002), covered by the pretence of these individuals that they are being social scientists. Such pretend social science is overly concerned with social theory and all of the post- and –ism terms (Gorard 2004 theory), and nebulous buzzwords such as globalisation, risk, the network society. It may fool students and those responsible for funding, and thus undercut through mechanisms of peer-review the efforts of those trying to do better. This is a major ethical issue for the public funding of policy research in the UK.

Our initiative

Our project, with website hopes to play a small part in encouraging a higher level of discussion on these matters in its debates forum, Wiki item development, working parties of social science trainers across disciplines/HEIs, and our free face-to-face events. We are also generating a register of UK trials, and reporting guidelines for public policy adapted from the CONSORT statement. We are holding an annual conference open to all interested parties. There has been considerable interest from policy-makers, including Work and Pensions, Home Office, and Cabinet Office – but not so far from the DfES. We have a growing mailing list for our quarterly newsletter ‘Trials in Public Policy’, which includes a brief column on practical hints and solution to research problems. There are, of course, problems to be faced in conducting trials and pitfalls to be avoided (such as naïve interventions,insufficient participants, poor randomisation procedures, multiple significance testing along with numerous and inappropriate subgroup analyses ). But the situation will be better improved by rational engagement and critique than by outright rejection.[1] Even where practical problems are insuperable, and a trial is not possible or ethical then research can still be improved from where it is now, in general, through a consideration of trial design (a thought experiment) For example, if an intervention is not possible then any causal claims are weakened, but matching and comparison groups are still possible.

References

Abrami, P. and Bernard, R. (2006) Research on distance education: in defense of field experiments, Distance Education, 27, 1, 5-26

Bailey, N (1967) The mathematical approach to biology and medicine, New York: Wiley

Bannan-Ritland, B., Gorard, S., Middleton, J. and Taylor, C. (2006) The ‘compleat’ design experiment: from soup to nuts, in Kelly, E. and Lesh, R. (Eds.) Design research: investigating and assessing complex systems in mathematics, science and technology education, Mahwah: Lawrence Erlbaum

Barrow, R. (2005) The case against empirical research in education, Impact, 12, Philosophy of Education Society of Great Britain