Confidence-Based Marking, As Implemented by Prof

Confidence-Based Marking - towards deeper learning and better exams

A.R. Gardner-Medwin,

Dept. Physiology, UniversityCollegeLondon, London WC1E 6BT

Abstract [ submitted to ALT-J 2004 ]

Experiences with confidence-based marking for formative self-assessment over 10 years, and for exams over 3 years,are surveyed from the perspectives of pedagogy, motivation, design, implementation and statistics.Recent developments to encourage dissemination and collaboration are described. The system employed, in which students rate confidence in each answer on a 3 point scale by computer, via the web or on an optical mark reader card, is simple, popular, fair and readily understood.In order to gain the most marks, students must reflect and attempt to justify confidence in an answer, and report their confidence honestly and accurately. This places a premium on careful thinking,and on checks and the tying together ofdifferent facets of knowledge, thereby encouraging deeper learning. In exams it generates higher quality data than conventional scores, with greater statistical reliability, arguably greater validity as a measure of knowledge (and from even the most sceptical perspective, no loss of validity) and less contamination from chance factors associated with weak and uncertain knowledge.

Introduction

Confidence-based marking, in which a confidence rating is taken into account in the marking of each answer, has been in use at University College London (UCL) for 10 years now(Gardner-Medwin, 1995). It was set up as a tool to improve students' study habits, in a protocol known as LAPT (London Agreed Protocol for Teaching - )devised for several medical schools, now mainly subsumed within UCL and Imperial College London (ICL). For study and revision purposes we employ both dedicated software for MS-Windows and a browser-based resource (LAPT-lite: for easy accessoutside the original institutions. The LAPT-lite systemnow offers equivalent flexibility and more material, in more subject areas and for different levels of education. It is the main platform for future development and collaboration, and for use in other institutions and schools.Since 2001, we have used confidence-based marking for the True/False components of exams in the first two years of the medical curriculum at UCL, with the same marking schemebut using optical mark reader technology.

This paper briefly surveys the main pedagogic, motivational, design, implementation and statistical issues, referring extensively to previous publications where points are covered there in more detail(Gardner-Medwin, 1995, Gardner-Medwin & Gahan, 2003).Readers are encouraged,before proceeding far, to try out confidence-based marking (CBM) for themselves on the website ( ), using any of the diverse exercises in different fields. Experience shows that students approaching CBM as learners tend to understand itslogicinstinctively through application, much more readily than through exposition and discussion.

Confidence-based marking has been studied quite extensively, mostly before computer aided assessment was readily available (Ahlgren, 1969). Our experience at UCL and ICL is probably the largest scale project where it has been used for routine teaching, learning and assessment. The mark scheme under LAPT asks for confidence judgement on a 3-point scale (C=1,2 or 3). This judgement is made independently for each answer in an exercise. When the answer is correct, then this is the mark (M) that is received (M=1,2 or 3). If the answer is wrong, andexpressed with lowconfidence(C=1), then there is no penalty (M=0).However, with C=2 and C=3 there is a progressively higher penalty for incorrect answers. For True/False questions, LAPT uses penalties of -2 for C=2 and -6 for C=3.A slightly moderated scale of penalties is used for more open question styles (MCQ, text or numeric answers), where unconfident though nevertheless significant knowledge may be accompanied by quite low estimates of the probability of being correct (<50%).The mark schemes are set out in Table 1.Note thatconfidence levels are defined simply by the scaling of marks associated with a correct answer (C=1,2 or 3). We deliberately avoid descriptive terms for the different degrees of confidence - certain, very sure, unsure, unconfident, guess, etc., because they tend to be relative terms, meaning different things to different people, and in different contexts.Appropriate use of the CBM scheme is defined not by linguistic norms but by the rewards and penalties, and by the ability of the student to justify taking the risk of a penalty if a wrong answer is given..

Confidence level: / C=1 / C=2 / C=3 / No Reply
Mark if correct: / 1 / 2 / 3 / 0
Penalty if wrong (T/F Q) / 0 / - 2 / - 6 / 0
Penalty if wrong (open Q) / 0 / - 1 / - 4 / 0

Table 1: Mark schemes employed by LAPT

The Rationale of the Mark Scheme

Several qualitative features, of pedagogic importance, are immediately clear to a student when thinking about answering a question with CBM. The fundamental points are as follows:

To get full credit for a correct answer you must be able to justify the answer to the point that you are prepared to take the risk that if you are in fact wrong, you will lose marks. This encourages careful thinking and checking. It makes it harder to rely on rote learned facts, which are often remembered with a degree of uncertainty.
If you acknowledge that an answer is based on incomplete or uncertain knowledge, then it is fair and correct that you should get some creditif the answer turns out to be correct, but less thanfor an answer based on sound knowledge and argument. Likewise, it is fair that you should be penalised less if the answer turns out to be wrong, because you have acknowledged the unreliability.
With a properly motivating CBM scheme(transparent and correctly graded) it is clearthat it isin a student's best interest to be honest about expressing his/her actual confidence, whether high or low.Setting C=3 while believing that an answer is quite likely wrong, or C=1 when sure of being correct, are both obviously poor strategies.Only certain marking schemes achieve this, and it is counter-productive to use schemes that do not (Gardner-Medwin, 1999, Gardner-Medwin & Gahan, 2003). The key issue for the student is that it is always worth trying to establish whether an answer is trustworthy,even if the result is to conclude that it is less trustworthy than initially thought.

These points encapsulate the initial reasons for introducing CBMas a means to improve students' study habits, particularly their inclination to think carefully.Unreliable knowledge of the basics in a subject, or - even worse - lack of awareness of which parts of one's knowledge are sound and which not, can be a huge handicap to further learning (Gardner-Medwin, 1995).Furthermore, the ability to make good confidence judgements and to communicate them to others, either explicitly or through inflection or body language,is a valued and necessary skill in every academic discipline. It is a skill that nevertheless remains largely untaught and untested in most forms of assessment before final undergraduate or graduate years, when it can be crucialina viva orin demanding forms of critical writing.

Alongside the rationale for CBM, it is worth addressingsome misconceptions that have emergedin discussions and correspondence with teachers. CBM does not appear to favour unfairly or encourage particular personality traits.It is sometimes suggested that CBM may favour one or other sex, usually based on the notion that it mightdisadvantage diffident or risk-averse personalities,supposedly more common amongst females. Two points can be made about this. One is empirical: careful analysis of comaparative data from exams and formative self-assessmentat UCL reveal small but statistically very clear differences in risk-aversion under these two conditions(with more cautious use of C=3 in exams),though there are not even small significant gender differences under either condition (Gardner-Medwin & Gahan, 2003).Secondly, if a student does have a tendency to be under- or over-confident, then this is an objective mismatch between expectation and performance that the student can usefully become aware of, and may be able to correct through the feedback provided by CBM.This is not to say that outwardly diffident or confident personalities are undesirable or unattractive, but rather that it is a serious handicap, especially in decision-rich occupations such as medicine, if internal inferences are not correctly calibrated for reliability.

A second, less serious, misconception is that the aim of CBM is to boost students' confidence.Of course self-confidence is something that ought to be increased by the use of any effective learning tool, and students often say, in evaluation questionnaires, that use of the LAPT system with CBM has helped to improve their confidence in the subject (Issroff & Gardner-Medwin, 1998).But they also say, and this is probably pedagogically more important, that it has helped to reveal points of weakness in their knowledge.One of the problems experienced with university entrants is that they often fail to identify these weaknesses for themselves. Often it is possible to pass both school and universityexams with knowledge only half learned or poorly understood,but nevertheless sufficient in most circumstances to generate correct answers. CBM places more of a premium on the reliabilityof such knowledge, and on the ability to understand, link and cross-check pieces of information.The net effect is often initially to reduce students' confidence as they come to realise that sound knowledge cannot just be based on hunches.

Confidence and Probability

The choice of an appropriate confidence level (C=12, or 3) is governed by two judgements that a student must make.First is the estimated probability that the chosen answer will be correct. Second is the impact of the reward or benefit when the answer turns out to be right or wrong. The objective rewards and benefits are of course set out in Table 1, and insofar as a student is trying to maximise his/her expected overall score it is easy to see the basis for an optimal decision, as shown in Fig. 1.

Fig. 1.The average mark expected on the basis of a student's estimated probability of an answer being correct, plotted as a function of that estimate, for each of the three confidence levels and for a blank reply. The two graphs are for the mark schemes shown in Table 1, for (a) True/False questions and (b) Open Questions for which the estimated probability of being correct may be less than 50%.

For a given estimate of the probability of being correct, one expects to do best on average, building up a total score, by choosing the confidence level with the highest line above the corresponding point on the horizontal axis. Thus for True/False questions, C=3 is appropriate if the probability (P) of being correct is >80%, C=2 if P>67% and C=1 if P<67%. It is of course not possible (or at least rational) to arrive at an estimate <50% for the probability that a preferred T/F answer is correct, since then the opposite answer would by definiion be the preferred answer. However with open questions, the preferred answer may have such a low probability of being correct (P<50%), and in these circumstances C=1 is the best choice, with C=2 best for P between 50% and 75% and C=3 for P>75%.

The weighing up of confidence and risk is generally a much more intuitive matter than its game theory analysis here might suggest. In discussion, students rarely discuss their decisions in terms of explicit probabilities, though the relationship has been explained to them. It is of course possible that students might, consciously or unconsciously, weigh other factors than the average expected mark in making confidence decisions. Watching students (and particularly staff) in their first encounter with CBM, it is not uncommon to see some who at first regard anything less than C=3 as a diminution of their ego. Butsuch considerations do not long survive a few negative marks. Despite an intuitive approach, students on averageuse the confidence bands in a nearly optimal fashion to maximise scores, with few showing proportions of their answers correct in the three bands that are outside the correct probability ranges (Gardner-Medwin & Gahan, 2003). This is consistent with the general finding that though people are rather poor at handling the abstract concept of probability correctly, they make good judgements when information is expressed in terms of concrete risks, outcomes and frequencies (Gigerenzer, 2003).

It is sometimes suggested thatother mark schemes or systems for entering confidence might be preferred,perhaps with word descriptions or explicit probability judgements rather than the mark-related confidence scale employed in LAPT.This is an issue extensively discussed in a previous paper (Gardner-Medwin & Gahan, 2003) where the limitations of some alternative schemes (Davies, 2002; Khan et al., 2001) are discussed.The key criteria for a good marking scheme are, it was argued there:Simplicity, Motivation, Flexibility and Validity.Simplicity essentially means that the scheme must be transparent and easily remembered and understood.Motivation means (as discussed above) that students do best by honestly reporting their confidence, though of course they also do betterif they can justify higher confidence.Flexibility means that the scheme can work in the same way for different question types, while validity (discussed below) means that the scores should correlate well with what one wishes to measure with an assessment system.

The modified penalty scheme adopted for open question styles(Table 1), introduced in April 2004, compromises the principles of simpliciity and flexibility somewhat. The reason for this change was not any fundamental problem with the original single scheme (equivalent to that shown for T/F questions), but an issue of apportionment of the large range of confidence that people may experience for open questions (from certainty, P=100% of being correct, down to almost zero for a text answer that one might describe as "a wild guess"). With the original scheme C=1 was appropriate for all values below P=67%. It was found that new users tended to reserve C=1 for rather lower levels of confidence and as a result were failing to match their proportion correct at higher confidence levels to the optimum (averaging only59% correct at C=2 and 82% at C=3). The best solution to this problem seemed to be to adust the penalties to even up the ranges of probability for the three levels (Fig. 2b), so that optimal usage matches better the way people were instinctively using the three levels for these types of question, rather than to rely on practice to ensure that theylearn to achieve the best marks.

Implementation of CBM

The LAPT-lite implementation of CBM ( ) was written in Javascript, by M. Gahan and the author.Since we already had a working implementation under MS-Windows, one can ask why a browser implementation was necessary.The prime reason was to provide a platform that is easily used by anybody, anywhere, to experience CBM without downloading software. However it also seems that with escalating security problems, browser protocols are increasingly the sole acceptable way to transfer information across firewalls. This is a pity, because browser-based software imposes severe restrictions, like the ability to save information on the local computer. LAPTlite is now the main development and dissemination platform.

The principal requirements for our system wereas follows:

Clear and simple entry of confidence levels following answers
Immediate marking of answers without web delays
Good feedback on CBM performance
Ability to function both online and offline, from CDROM or saved files
Facilities to submitresults for analysis on a server
Facilities to submit comments on questions, with automatic contextual information
Simple and transparent authoring, adaptation and editing of exercise files
Sophisticated question and answer formats,equivalent to the existing PC system

The solution adopted has been to use the server principally as a file source equivalent to a CD-ROM, from which HTML, Javascript and Graphics files are obtained. Presentation and marking is done entirely on the student's computer, using information in these files without involving the server.Submission of results and comments is then carried out by sending information to a .PHP file on a server (or one of a selection of alternative server sites).This solution gives rise to an important limitation, which is that the system is intrinsically insecure: it cannot be used for high stakes summative assessment, because the downloaded files contain all the information about correct answers, which could in principle be accessed by other software to ensure that correct answers are entered. However, this is of no consequence for formative assessment which is the main purpose.A secure system would require that the marking all be done on a server, with the attendant problems of web delays and loading issues. A second minor problem resulting from the use of client-based computations is that the collation of data about usage and perforance is dependent on voluntary submission of results over the internet, and it turns out that only a small fraction of users loading exercises have been submitting this data.

The functions of presentation, marking and data handlingare separated from those of exercise definition by dividing the script into two files - a program and an exercise file, both technically javascript files.The program is the same for all exercises and can be independently updated.The exercise files are mostly plain or html text, easily readable and easily adapted from a wordprocessor file defining questions and answers. Files can alo be created in an authoring package ( or adapted from WebCT formats with atool developed by D. Stowell ( ). They consist of what are technically javascript function calls, using syntax described in an authoring manual ( )and illustrated for some simple cases here :