1
Evident-based Dentistry as Bayesian Updating
Draft of 7 October 2015
At this stage of a manuscript I place the full reference citation in the text. References will be converted to the style of the journal to which it is submitted.
I flag areas needing more work in red.
I must decide whether to redact references to my own publications as it would be obvious from them who wrote the paper. JADA does not, however, blind reviewers to the author’s identity.
The American Dental Association defines evidence-based dentistry as “the integration of best research evidence with clinical expertise and patient values” [J Am Col Dent 2010;77(4):entire issue]. There is very little empirical work on the influence of patient (and practitioner) values. There is also virtually no literature showing either which experiences practitioners attend to or learn from and how they integrate clinical expertise with best research evidence.
This problem has been studied extensively and comprehensively summarized for decision science generally [Kahneman D, Slovic P, Tversky A., Eds. Judgment under uncertainty: Heuristics and biases. Cambridge, United Kingdom: Cambridge University Press, 1982; Kahneman D, Tversky A., Eds. Choices, values and frames. Cambridge, United Kingdom: Cambridge University Press, 2000; Gilovich T, Griffin D, Kahneman D, Eds. Heuristics and biases: The psychology of intuitive judgment. Cambridge, United Kingdom: Cambridge University Press, 2002]. At the theoretical level, we have known the solution to this problem since it was pointed out by Thomas Bayes in the mid-eighteenth century. The likelihood that a condition exists given supporting evidence is called the posterior probability, or probability after weighing the evidence. The general situation is known as the baseline. That is the general probability before looking at the evidence. For example, there is about a 20 percent chance that a school-aged child selected at random will have at least one carious lesion. The prior probability is determined by looking at evidence for a particular case. An oral inspection of a child’s mouth by a dental student at a health screening adds additional information, for example. An oral inspection provides evidence that helps refine the probability, but it would not be definitive. A lay person guessing “no caries” on every case just on principle would be right most of the time.
Clinicians are interested in posterior probabilities, the likelihood that a condition exists given the evidence and the baseline, Pr ( C | E). That value can be estimated using a formula: Pr (C | E) = Pr (E | C) * Pr (C) / [Pr (E | C) * Pr (C) + Pr (E | ~C) * Pr (~C)]. The likelihood of the condition is equal to the probability that the condition will produce the evidence in groups with this kind of baseline divided by the probability of true positive evidence plus false positive evidence. As a general rule, we need very strong evidence to decide about rare conditions. We can be more certain of common conditions, even in the face of modest evidence. In the absence of strong baselines that can be provided by strong literature and when diagnostic confidence is compromised, probability estimates regarding clinical matters drift toward the mid-point of p = .50. This is an “I’d rather not say based on what I know now” position [Einhorm HL Learning form experience and suboptimal rules in decision making. In TS Wallsten, Ed. Cognitive Processes in choice and decision behavior, Hillsdale NJ: Lawrence Erlbaum, 1980, pp. xx-xx].
An ample medical literature and at least one paper in dentistry report that practitioners are poor at performing such estimates, even when supplied with precise figures for prior and baseline. For a summary, see Chambers et al [Chambers DW, Mirchel R, Lundergan W. An investigation of dentist’ and dental students’ estimates of diagnostic probabilities. JADA 2010;141(6):656-666]. No alternative has been identified as a consistent decision strategy depicting how practitioners integrate clinical information. Virtually all previous research has used theoretical prompts (drawing colored balls from an urn) [Kahneman D, Slovic P, Tversky A., Eds. Judgment under uncertainty: Heuristics and biases. Cambridge, United Kingdom: Cambridge University Press, 1982; Kahneman D, Tversky A., Eds. Choices, values and frames. Cambridge, United Kingdom: Cambridge University Press, 2000; Gilovich T, Griffin D, Kahneman D, Eds. Heuristics and biases: The psychology of intuitive judgment. Cambridge, United Kingdom: Cambridge University Press, 2002] or has asked for one-off estimates from physicians or dentists in response to scenarios prepared by the experiments [Chambers DW, Mirchel R, Lundergan W. An investigation of dentist’ and dental students’ estimates of diagnostic probabilities. JADA 2010;141(6):656-666].
In the research reported here, respondents made sequential judgments of a dental matter given additional information at each point while being free to form their own estimates of prior and baseline probabilities. This was an extension into the dental context of a design published by Ward Edwards [Edwards W. Conservatism in human information processing. In B. Kleinmuntz, Ed. Formal representation of human judgment. New York, John Wiley and Sons, 1968; Griffin D, Tversky A. the weight of evidence and the determinants of confidence. Cog Psych 1992;24(x):411-435]. The respondents’ estimates were chained to test the hypothesis that clinical learning is a form of Bayesian updating.
Materials and Methods
The experimental task consisted of a paper-and-pencil exercise where dentists estimated the proportion of general dentists who have attempted to preform full-mouth reconstruction cases are competent to do so. This is a personal judgment; there is no known “right” answer, and respondents are presumed to hold a range of opinions and may take action based on such judgments. Various information was supplied to respondents in a staged fashion, and the principal outcome variable how respondents’ judgments changed as new evidence became available. This was a study of how dentists combine personal clinical opinions (baseline) with evidence (prior probabilities).
The exercise consisted of 12 questions in a paper-and-pencil format. Two global measures were taken at the beginning. Respondents were asked to report their confidence in judging a dentist’s competence based on examining samples of the dentist’s work. A score of 1.00 was defined to mean that respondents were absolutely certain of the generalizations they draw from this kind of evidence. A score of 0.50 meant that respondents feel they learn nothing from examining what a colleague had done: half of the time they would be wrong. Respondents who indicated a global probability lower than 0.50 were removed from the study as that would be tantamount to systematic bias. A global baseline regarding probability that general dentists are competent to perform full-mouth reconstructions was also reported at the beginning of the exercise. Global competence was indicated on a scale from 0.00 to 1.00. Zero meant that no dentists who had ever attempted such cases could do them successfully; 1.00 meant that all dental dentists who have done any of this work are capable of producing satisfactory results.
Next, 4 questions were posed with respect to a hypothetical Dr. X. Respondents were asked to imagine that they were serving as judges in a clinical trial where patients of this dentist presented with completed full-mouth cases. Dr. X had been randomly selected from among all general dentists who had performed such procedures. Evidence about the success or failure of each patient was presented, and the challenge was to judge the competence of Dr. X based on the evidence. Three patients were presented in succession, and respondents recorded their judgment of Dr. X’s competence after each patient. An initial judgment of the competence of Dr. X was reported after the global rating of all dentists and before seeing the first patient. A related exercise involving the impact of evidence from the literature bearing on such judgments was presented next. Finally the pattern used for judging Dr. X’s competence from three patients was repeated with a new dentist, Dr. Y.
The judgment of expected probability of success for Dr. X and Dr. Y taken before seeing the first patient in each case served as the baseline probability. Evidence for success or failure on patients was used in determining the prior. Because a prior is the estimate that evidence is a true positive, an adjustment was made for the information about success or failure of the case. That adjustment came from the respondent’s personal reports of the confidence they had in making such judgments based on examining a patient. In order to avoid the potential that presentation of the evidence from patients would artificially constrict respondent’s judgments, the evidence was presented as self-validated. For example, the description of the first patient contained this wording: “By your personal definition [the case is] an unambiguous failure.” All evidence was framed in terms of the respondent’s own standards, such as “one of the nicest cases you have ever seen – beautiful” and “an outstanding success by your standards.” Thus all prior probabilities used in the study were self-normed, being based on personal estimates of probable outcomes, personal confidence in evaluating evidence, and personal standards for successful or unsuccessful outcomes.
The research paper presented as evidence in the exercise was placed between the examinations of Dr. X and Dr. Y. The hypothetical paper was described as a meta-analysis of 27 large-sample RCTs, with thousands of subjects with good sampling schemes and tiny standard errors, analyzed by three respected scientists. It was said to exceed all conventional standards for research rigor. The paper reported that 45 percent of general dentists in samples of those who have attempted to perform full-mouth reconstructions were judged competent in this procedure. Subjects were asked to report the “believability” of such research, considering the respondent’s own ability to comprehend such studies, on a scale of 0.40 to 1.00. Values below 0.50 were where characterized as being misleading research.
Seventy-one fully- and appropriately-completed responses were obtained from clinical faculty members and orthodontic and endodontic residents at [name of institutions redacted during review]. The projects was approved in the exempt category by the IRB at [name of institution redacted during review], #15-117.
Respondents’ probability judgments before seeing the first patient and then after each of the three successive patients were recorded. Bayesian values corresponding to these respondent judgments were computed based on prior and baseline data from each respondent. The Bayesian calculations were performed individually, with each respondent providing the necessary personal baseline and priors. Bayesian estimates were chained, called Bayesian updating, with the outcome judgment on one trial becoming the baseline probability for the next trial. There were two estimates of priors used in calculating the Bayesian estimates. In one analysis, the global confidence reported at the beginning of the exercise was carried throughout. In the other analysis, confidence in priors was updated as were the baselines. This involved solving the Bayesian equation backwards to determine what the immediately previous level of confidence must have been, given the known baseline and estimated outcome, in order to support the judgment given.
Differences between respondent’s judgments and calculated Bayesian estimates were tested with paired-comparison t-tests. These relationships were also graphed in scatterplots to inspect for linearity and to determine the slope to the regression line relating calculated and judged values. Statistical tests were also performed on correlation coefficients, the slopes for regression lines, and goodness of fit to a normal curve. All analysis was performed on an excel spreadsheet. The paper-and-pencil exercise, dataset, and Bayesian formulas used are available at [Web site redacted during review].
Results
Table 1 summarizes the major findings of this study. Cumulative probability estimates of the two hypothetical detests’ capabilities could be taken in the manner of incidence as the simple ratio of successes to attempts at each observation. Dr. X had one failure, followed by two successes (0.00, 0.50, 0.67); Dr. Y had two failures followed by a success (1.00, 1.00, 0.67). This metric is shown on the top row of the table. Respondents’ estimated scores reveal a more nuanced pattern (second row). Roughly the same temporal pattern was observed for values calculated using Bayes formula (third and fourth rows), except that the Bayesian estimates were always more extreme (further from initial global estimates). The Bayesian estimates using updated confidence estimates were slightly closer to the respondent’s judgments, suggesting that respondents also updated their confidence in prior values to some extent. The paired-comparison t-test values for differences between judged values and calculated Bayesian values were all significant at p < .001.
Figure 1 shows the overall updating on successive estimates as well as the conservative nature of the responses. In all cases, respondents’ estimates were between the Bayesian calculation and the initial values attributed to Dr. X (0.559) and Dr. Y (0.519). Respondents “hedged their bets.”
Figure 2 shows the first and third evaluations of Dr. X’s capability based on observation of patients and prior and baseline values. The right-hand panel is what one might expect. Low calculated values with Bayes formula generally go with low estimates and vice versa. There was a tendency for respondents to underestimate, as represented by the majority of the data points falling below the diagonal green line of perfect agreement. The slope of the regression line is less than 1.00. If respondents used the baseline and prior information, even inaccurately on average, the regression of estimate on calculated value should be 1.00. A slope greater than 1.00 would indicate overuse of the available evidence; a slope less than 1.00 would indicate that low estimates were not as low as they should be and that high estimates were not as high as they should be. All 8 regression lines in this study had slopes significantly lower than 1.00 at the p < .001 level.
The linear r-values for the correlation between reported estimates and calculated Bayesian values can be taken as a measure of respondents’ accuracy. These correlations are all modest, although significantly greater than 0.00 at p < .001. When the gap between reports and Bayesian calculations for X3 and Y3 are calculated using the initial global confidence figure, the gaps are large. Using the priors from personalized judgment for each dentist, then from the first trial, and then the second, the size of the discrepancy diminishes when updating is permitted. More current estimates of priors contribute to more accurate judgments. It is also obvious that there is a wide range of opinion regarding capability of general practitioners to performe full-mouth reconsturction, even in the face of concrete evidence.
The left-hand panel of Figure 2 is unusual. A quadratic regression line is a significantly better fit than a linear one. The “inverted U” shape was observed for both cases described as failures (X1 and Y3), while all four positive outcomes were best described by a linear relationship. The negative outcome appeared to confirm initially skeptical respondents’ opinions and provide compeling contrary evidence for those initially favorable, while those in the middle reserved judgment.
Figure 3 casts further light on the tendancy for some respondents to perservere in their initial opinions in the face of contrary evidence. This “dish-shaped” non-lenear regression displays responsiveness to evidence on the verticle axis and global estimate of competency to perform full-mouth reconstructions on the horizontal axis. The verticle axis is the average absolute value of changes from one case to the next as new information is presented. Those taking a non-commital postion at the start of the exercise were also less willing to adjust their views given disconfirming evidence.
Further support that individual differences exist in responsivness to evidence is shown in Figure 4. Average absolute changes in estimated probability are shown there on the horizontal axis of this frequency distribution. This display is right-skewed and bimodal. A test for goodness of fit to a normal curve is rejected at the p < .001 level. There are respondents who were unprepared to change their initial opinions based on evidence.
Figure 5 displays two unusual relationships involving respondents’ estimates. The left-hand panel shows the early global estimate that any patient of any dentists who performs full-mouth reconstruction will be successful on the horizontal axis and the estimate that the first patient seen from Dr. X will be a successful case on the vertical axis. This is a “null” association where only baseline data are available and there is no evidence. Logically the probabilities that any patient at random is successful treated and that any patient randomly drawn from a randomly drawn dentist is successful treated are identical. All datapoints must line up on the diagonal. There is large, unexplained random variation.
The right-hand plot displays the Bayesian value for a new baseline as described in the summary of the article in the literature, while the vertical axis shows the estimates made by respondents. Two things are obvious. First there is no tight clustering around the 0.45 point reported in the literature. This means that these respondents retained much of their previous estimates regarding capability of general dentists in the face of a description of a hypothetical research paper of supreme rigor. Second, the best-fit line for the relationships between calculated and estimated values is quadratic. In this case, it is a “J-shaped” curve, with a tendency to overestimate the value of the literature for those who have low initial estimates of the probability that general dentists are capable of this work. Respondents were asked to consider both the authoritativeness of the paper and their own ability to comprehend such literature as elements in the “believability” or relevance of the study. The average score was 0.713 (SD = 0.145), where 1.00 represented a definitive anchor point for knowledge and anything below 0.50 was “misleading.”