Joint funding bodies’ review of research assessment
Submission from the Royal Economic Society
1 Introduction and motivation
The aim of the research assessment exercise (RAE) is to assist with the efficient allocation of resources to support research activity. Achieving such an aim raises six main issues, as follows:
(1) Accurate measurement of research quality at the level of specific contributions (section 2);
(2) Cumulating such contributions for individual researchers (section 3);
(3) Appropriate aggregation of the resulting measures within departments (section 4);
(4) Selecting suitable panels (section 5);
(5) Efficient allocation of research funding between units (section 6);
(6) Providing incentives to sustain and improve research performance (section 7).
Each of these issues is open to analysis, taking account of their important interactions. Submitting individuals and their institutions will seek to optimize their anticipated outcomes given any rules that the RAE promulgates. Panels will be concerned both to achieve cross-institution comparability and also seek to optimize the health of their discipline. The HEFCE is concerned to achieve an equitable and efficient allocation mechanism that enhances future prospects. Thus, any methods of assessment must be guided by an analysis of the incentives facing every party, the resulting (possibly distorting) feedbacks, game playing by all parties within the letter of the rules, and the conditional optimization by each given the behaviour of all the other parties. It is crucial that the rules of the exercise—as a whole, and for each panel—should be clear, published in advance, and adhered to. A key aspect of this analysis is noting, and suggesting ways of avoiding, inadvertent distortions of outcomes.
We assume that:
any future RAE will be restricted to the assessment of research quality (as against volume) using a broad and inclusive definition of research; the RAE will be based on peer review as in previous exercises, without pre-judging how any panels might be selected or their composition; RAE outcomes will be based on the individual research contributions of each faculty member entered, aggregated to provide an overall score for each department.
2 Measuring research quality of specific contributions
The ‘quality’ of any specific research output can be judged in terms of its:
creativity, technical merits, substantive contribution, empirical relevance, scholarship, and stimulus to further developments.
Panels are presumed to have the professional competence to judge such attributes (see section 5), although some aspects (such as the last) may take a considerable time to show through, and genuine creativity is often not well received initially. A ‘peer group’ expert reading the relevant putative contribution is our only present system, either at the output source (a reputable international journal being deemed sufficient to define ‘international quality’) or by a suitable subset of panel members. It is important not to prejudge quality as low because the outlet is not impressive, so the input and expertise of the panel is crucial.
3 Cumulating research contributions of individuals
The ‘quality’ of a researcher is a latent signal to be extracted from observable outcomes in the noisy environment that constitutes creative research. The general theory of signal extraction seems broadly applicable, and indicates using averaging across both time and items. Thus, a long evaluation period seems desirable, perhaps with a minimum of five years (and possibly including one item – presumably the best – from the previous RAE), potentially longer for some units of assessment (UOAs) to reflect the circumstances of their subject area (Humanities are, of course, already 7 years). Indeed even longer evaluation periods might be used specifically to give some idea of the impact of an individual’s writings. Doing so would also reduce the current distortion against long-gestation fundamental research. The opportunity cost would be a more ‘historical’ allocation, which might drift for some institutions over the horizon from the initial rating. This difficulty could be slightly offset by re-investigating the possibility of using forward-looking indicators (e.g., final acceptance letters, or evidence that the relevant item was being ‘set’ for publication). In any case, even with 4/5 year gaps, the results of one exercise are barely announced before institutions start preparations for the next as publication lags are long.
Equally, a balance must be struck between an adequate number of items to be evaluated, and an inadvertent emphasis on volume. One major publication per annum seems a high, if reasonable, basis in many subjects. Different subjects could be asked to define their own criteria, mediated by expert views of the appropriate learned societies and official institutions within their disciplines, at the possible cost of less ‘comparability’ across UOAs. Four items already seems close to imposing insuperable burdens on panel members. Panels might need persuading that ‘up to four’ means precisely that: We are aware that some panels deemed fewer than four to indicate a lack of research productivity.
The choice of the ‘averaging procedure’ across the items submitted by any one researcher also merits careful consideration, since ‘high quality’ is an upper bound: specifically, much more weight should be attributed to the best work of those items considered, rather than equal weights for all items. Applied to a researcher’s total output, therefore, a maximum score could be as high as 20 for 4 items submitted—where ‘international’ corresponds to 12—using (say) the scale: outstanding (5); excellent (4); international (3); high national (2); national (1); sub-national (0). Such judgements are already fairly widespread in the research community, and may ease rather than complicate a panel’s task. In addition panels may wish to make use of citation counts for those who have more than ten years’ experience to get some measure of each individual’s previous writings. This will provide additional information, particularly in the light of the fact that a significant proportion of papers, even in the top journals, are never cited and appear to make no impact whatever. We return to the issue of aggregating across individuals in section 4 below.
Other professional activities that enhance the quality of research output, such as refereeing and editing journals, should logically be included within the definition of research, again to avoid disincentives to undertake such work. The problems of measurement here are serious, although editors could be asked for information on referee inputs, and panels could use their professional knowledge on the value-added by editors of major international journals. However, the market seems to be responding to the present exclusion by raising honoraria for editors, and although folklore suggests refereeing may be suffering, some journals now pay for that service as well.
If quantitative assessments of quality (such as ‘bibliographic measures’ like citation counts) are to be combined with peer review in some systematic way, precise definitions are required with criteria that are not open to distortion once used: controlled testing of their potential impact is essential. Presently, indicators of ‘peer esteem’ are reported, but panels appear to vary greatly in the weight accorded to these in their final judgements of a department’s rating.
4 Quality aggregation within departments
A graduated scale, avoiding significant funding thresholds, has many advantages:
reducing the ‘hiring frenzy’ just prior to RAE exercises followed by the doldrums just after, to the detriment of that specific ‘generation’ of young scholars; reducing the probability that small classification mistakes by submitters, or panels, would have large financial consequences; lowering the chances of litigation when a difficult decision close to a boundary was misjudged; reducing panel debates on boundary cases; avoiding dubious calculations by departments on which faculty members to enter so as to maximize their rating.
Such a scale could build on the 20 point scale for individual faculty discussed in section 3 above, to reflect work being of outstanding through to of zero value. These scores would be cumulated across each department and that would be the determinant of its funding. For example, 10 individuals with a cumulative total of 120 points would be accorded the same funding, however that total were reached (e.g., 6 at 20 with 4 at zero; or 10 at 12; etc.). Thus, all staff could be costlessly submitted (avoiding one of the present distortions), with research-inactive staff still not funded. Such a process would also properly reward the highest quality, most innovative and valuable research far more than minor contributions. Departments with equal scores would receive equal funding: at present, they may receive divergent amounts since an ‘international’ rating is based on the median quality researcher, and a slight mistake by a department (or even by a panel) could lead to a major rating change. Within such a system, the HEFCE could still offer a premium to the very best researchers as it chose by changing the basis scale from that above – a more extreme illustration is: outstanding (10); excellent (8); international (5); high national (3); national (1); sub-national (0).
Alternatively, but with incentive effects on who is entered, a related notion would be to compute the average score per faculty entered. Now a department of 6 ‘stars’ and 4 inactive could be funded at a higher (or even lower) level as the HEFCE decided was most desirable, with consequential impacts on how individual’s devoted effort to research (see section 7). Inputs (such as a department’s success at fund raising, for example) should not be used in judging output quality, except in so far as they reveal ‘peer esteem’. They would be otiose anyway in a system based on individual outputs like that above.
5 Composition of panels
The selection and composition of panels is crucial to the success of the research evaluation exercise. Panels must have the professional competence to judge research-quality attributes, and high standing to be deemed credible. They must obviously be independent.
Overseas membership of panels may be helpful, to provide an element of international
comparison within UOAs. There are three options: completely international panels; international assessors on panels; and an ‘external international examiner’ system to look at top, bottom and marginal cases. The third option balances cost and coverage, but the use of
external assessors in the latest RAE does not seem to have markedly affected the ratings awarded by panels.
All cited research items should be available, to be provided to assessors on request. There must be adequate resources at the HEFCE to check both staff entered and citation lists, and while financial disincentives for attempted fraud could be introduced, reputation alone seems to provide strong incentives for accurate entries. Any increase in the monitoring load would need to be accompanied by a more generous scheme for recompensing assessors for the very considerable amount of time involved (e.g. by providing buy-outs), especially in order to attract assessors of the highest calibre.
6 The allocation of research funding
Having achieved an agreed score for each department within each discipline, we have something which may be used as a basis for the allocation of research funding within subjects. This has been addressed, in part, in the previous sections, and depends on how basic scores are mapped to final funding levels. The greater the ‘premium’ placed on the most outstanding research, the higher the incentive for departments to attract the best researchers. We are not aware of empirical studies of any returns to scale or scope in academic research, but most academics believe there are advantages to being part of high quality groups. The HEFCE could choose an allocation system to sustain that by a more than proportionate increase in funding for higher average scores (howsoever calculated) as it does at present, but with the potential distortion of ‘game playing’ by departments (e.g., creating shadow departments or ‘teaching-only’ contracts for non-performers) which may reduce actual research effort while increasing the (weighted) amount measured. Conversely, ‘critical mass’ arguments point towards also (or instead) using the total score.
The question of allocating research funding across UOAs is much more difficult. The basic problem is that, whether or not panels contain international assessors, evidence from past RAEs indicates that comparisons of either levels or changes of panel scores across subjects have no validity. For example, the dramatic differences in grade inflation across subjects and the impact of grade inflation on across-subject funding allocations has undermined confidence both within and without academia. Furthermore, no amount of increased involvement of international experts will help with this, since there is an inevitable tendency for them to ‘go native’ and to want to support ‘their subject’. So if the RAE is to make any contribution to a fair and sensible allocation of funds across subjects, some objective system of cross subject comparison must be found.
A possible approach is the following. First, survey all UK heads of department and ask them to name the top ten non-UK departments in the world. From this survey, derive a list of the top ten non-UK departments. Second, generate an objective mechanism for rating the top ten UK departments against the world top ten. This rating must be numerical and comparable across subjects. For example, we might take the ratio of citations per capita referring to the last ten years of publications. Thus we take citations per capita in the UK top ten divided by citations per capita in the non-UK top ten. This number can then be used to adjust the RAE ratings in each subject to make them more comparable across subjects. Note, we are not comparing citations per capita across subjects, which would not be valid for obvious reasons.
A simple alternative would be to make use of the ‘top hundred’ in each subject, which the company that runs the citation indices produces. The proportion of UK academics in the world top hundred gives some indication of the quality of research in each subject area in world terms. There are other possible mechanisms, but some sensible procedure is urgently required. At present, the RAE scores are more or less worthless for cross-subject comparisons.
Turning to the general issue of the efficient allocation of resources between UOAs, in principle, the marginal ‘research quality productivity’ should be equated across all units, so that the impact of the next £1 of funding is the same everywhere. In practice, unfortunately, it is almost impossible to undertake the relevant calculations, which must reflect both the ‘wealth creation’ and the ‘social welfare’ aspects of research findings. Nevertheless, one direct implication is that the cost of research should not determine the funding allocation, except perhaps inversely where high cost is the result of inefficiency and hence low marginal value. Cost and funding should only be positively related coincidentally, when the highest cost disciplines are also the most productive: but a productivity-based formula could directly address that issue without needing a cost element. Otherwise, to tie funding to costs risks allocating most resources to the least value for money, with the perverse incentive that costing more now brings in greater funding later. Nevertheless, the present system uses explicit cost bands to determine its allocation. There may be grounds for such a decision, but they are not obvious: moreover, they lock in historical anomalies in funding, since UOAs better funded in the past have since undertaken more expensive research, leading to a positive feedback. Changing such a link should, of course, be done slowly not to waste past investments, but a detailed reconsideration of ‘cost-band funding’ is overdue. Society may value high quality research per se, and hence be willing to pay whatever costs are needed to sustain each area, but that seems unlikely in view of past real-value reductions in research funding.
Equity in relation to quality across units of assessment within broadly adjacent fields is essential. Greater interaction and comparability between panels would be welcome, to ensure parity of quality assessment and to reduce incentives to move staff into units where competition is less intense. (There is strong evidence of weaker Economics Departments moving into Management and Business: the number of departments entered has fallen precipitously since the first RAE.)
Depending on the allocation system between UOAs, panels may face inappropriate incentives in deciding on the average rating they award within their discipline. The present system appears to entail different funding levels for UOAs in the same cost band when some panels award ratings different from the overall average. An example best highlights the possible anomalies: there was a drastic cut for Economics in Scotland, when the Economics panel was almost alone nationally in awarding a slightly lower national average rating than in the previous RAE, despite its written commentary praising the improvement in research quality, whereas average panel grades rose by 0.6.
In the previous RAE, disciplines in the same cost band received the same basic funding unit (determined by the number of faculty in departments with 3 or above). Such a method entailed that if all departments in a UOA were awarded a 5, they received the same funding per faculty as if they had all received a 3. Thus, departments in a UOA with overall high ratings could receive less finance than a lower-rated department in another UOA (e.g. a 4-rated department when all others were 3). Finally, for disciplines as a whole, departments dropping numbers to achieve higher grades had perverse effects (e.g., in a UOA with two departments of 20 potentially both 3a, when each omits several faculty to achieve 4 they would receive exactly the same funding for the remaining faculty submitted, and hence less in total). Such an allocation system induces perverse outcomes. The next inter-UOA allocation system needs careful consideration to avoid distortions.