Construction and Interpretation of ’Predictiveness’ Diagrams

Statistical methods for diagnostic test evaluation are catching up with therapeutic trials. Margaret Pepe and her group have contributed the predictiveness diagram (1). In what follows the construction and use of this diagram is explained in terms as simple as possible. This includes the special steps that must be taken when the empirical data come from separate samples of the diseased and the non-diseased (‘two-gate’ studies). Towards the end some more advanced topic will be mentioned.

The clinical situation envisaged is that of a stream of cases (or screenees) that may or may not be suffering from disease D. One wants to characterize the diagnostic power of a quantitative diagnostic test or a diagnostic score calculated from several clinical variables. The variable is taken to be continuous; if it is not, a few, fairly obvious, minor modifications are needed in the present text to allow the user to handletied observations.

The square region in the planewithin which both coordinates (x, y) are between zero and 1.00 is called the unit square. The predictiveness diagram uses the unit square to represent the entire population of persons that undergo testing (its area = 1 = 100%). The predictiveness curve is based on sorting the members of the population by their ‘post-test’ disease risks: given a pointon the curve with ordinate y = a posttest risk, the associatedx showsthe fraction or percentage of the population whose risks are less than that; sox is acumulative fraction, based on increasing risks.

One may note that the diagramthus shows the cumulative distribution function(cdf) of the posttest risk with the axes interchanged relative to usual plotting.

Figure. Interpretation of predictiveness diagrams.

The pretest risk of disease here equals 0.2. First ignore the dashed lines.

The mark on the vertical dividing line shows that when those who have posttest {posterior} risks > 0.15 are selected for treatment (regarded as ‘positive’), then 72% will be negative, 28% positive. The areas indicate the resulting percentage true-positive, etc. Sundry calculations: The sensitivity becomes 0.184/0.2 = 92%, specificity = 0.704/0.8 = 88%. The predictive value of a positive result (PVpos) is always > 0.15 (on the average it amounts to 0.184/0.28 = 66%). And PVneg is always > 85% (average PVneg = 0.704/0.72 = 97.8%).

Dashed reference curves: a completely uninformative test invariably reproduces the pretest {prior} risk (horizontal line); a perfectly discriminating test (all subjects diagnosable without error) would produce the step curve.

Suppose the curve passes through (x, y) = (0.72, 0.15), as in the figure. If you choose a posttest risk threshold of 0.15, then the diagram tells you that 72% of those tested have a risk below 0.15; the remaining 28% will have higher risks. Conversely, if you can afford to treat only 28% of the population, the diagram shows that you should treat those with a risk above 0.15; the 72% you cannot afford to treat will have smaller risks – though some may clearly have risk levels pretty close to 0.15.

The predictiveness curve runs across the unit square from a left-most point (x = 0, y = smallest risk in population) and reaches the right-hand wall at (x = 1, y = largest risk). It thereby divides the unit square into an upper and a lower region, the areas of which have a simple interpretation: The area of the region below the predictiveness curve is the mean ordinatevalue and thus portrays the mean posttest probability of disease. The area therefore necessarily equals the corresponding pretest probability (by virtue of the general theorem: mean of conditional mean = unconditional mean). Similarly, the area of the region above the curve is the pretest probability of non-disease (which is 80% in the figure). In fact, by virtue of the way the abscissa, x, is defined, it is ensured that each member of the population contributes equally to the total area; so the area of any subregion represents a fraction of the population– and hence a probability statement concerning a randomly chosen member thereof. In particular, the probabilities that the next case will be true (false) positive (negative) are visualized: the four regions demarcated by the curve and the vertical dividing line through the cut point portray these four probabilities; see figure. E.g., the regionbelow the curve and right of the cut has an area of 0.184, which isthe probability that the next case will be true positive. (As just mentioned, the region below the curve represents the diseased, and the vertical dividing line splits that region into a false negative contingent and a true positive contingent, the size of each contingent being the area of the region concerned.)

The legend to the figure gives the resulting sensitivity and specificity. As regards predictive values, note that the posttest risk is itself the pertinent predictive value, conditional as it is on the exact test result (or diagnostic score) obtained in any given case. However, for what they are worth, averaged predictive values based on the chosen dichotomization can also be calculated as indicated in the legend.

Included in the figure are, for comparison, the predictiveness curves of an uninformative test and a perfect test, both based on the pretest risk of 0.2.(Theoretically, any non-decreasing curve with area below it = 0.2 might arise.)

For other aspects of predictiveness diagrams, see the original article by Pepe et al.

Predictiveness curvesbased on specified pretest prevalences

Suppose your dataset comes from a ‘two-gate’ study (separate-sample design), and the disease prevalence in real clinical clienteles is different from the artificial (investigator-chosen) composition of the dataset. It is obvious what happens when one applies a logistic model or other statistical models to the given dataset. They produce, for each subject, a predicted posttest probability, P, obtained by fitting the dataset as it is. The estimate therefore reflects an artificial situation. To remedy that, one may proceed as follows.The critical assumption is, of course, that the diseased and the non-diseased samples are each representative of the real populations of those with and without the disease.

For a moment let us stick to the logistic model. Here,P is defined via the corresponding log odds value, S:

P = exp(S) / (1 + exp(S)) , S = ln(P / (1 – P)), (1)

P/(1 – P) = exp(S) being the posttest odds of disease. (Software normally allows the user to get hold of each individual subject’s P or S, so that he does not have to write a separate program for that purpose.)

Supposethe dataset containsn patients with disease D and m non-D cases, so that the (artificial) pretest odds equaln/m. Consider a new subject being tested. By the rule that (posttest odds) = (pretest odds)∙L, whereL stands for the likelihood ratio occasioned by the subject’s test results, we know thatL must satisfy the following equation:

P/(1 – P) = (n/m)∙L .(2a)

In passing, one may note that, if L > 1 (< 1), the test result speaks for D (against D). If L = 1, the data have made D neither more nor less probable than it was a priori; incidentally, this corresponds to the shoulder of the ROC where the local slope is 1.00 (45o angle).

Now suppose you know from separate sourcesthat the pretest prevalence isp'.That changes the subject’s posttest risk toP', the relation being now:

P'/(1 – P') = (p'/(1 – p'))∙ L .(2b)

To determine what P'level the subject’s test result would imply were the pretest prevalence p', we may therefore isolate the likelihood ratioL implicit in (2a) and insert it in (2b). The first step in constructing a p'-based predictiveness diagram is therefore this: For each of them+n records a likelihood ratio L is calculated using the score S from the logistic model.

L = exp(S) / (n/m) .(2c)

Let q' = (1 – p') denote the pretest probability of a non-D. From (2bc) the adjusted posttest odds can be calculated for each of the m+n records:

P'/(1 – P') = (p'/q')∙ L = (p'/q')∙ exp(S) / (n/m) ,(3)

or, when solved for the adjusted risk of disease:

P' = p'L / (p'L + q') .(4)

This defines the subject’s ordinate in the prevalence-adjusted diagram we are constructing. Incidentally, it all boils down to adding a fixed adjustment, ln((p'/q')/(n/m)), to the logisticS, as can be seen by comparing the logarithm of (3) with (1); in other words, the ‘intercept’ term of the model output changes by that amount.

The same steps (2-4) are applicable with models other than the logistic; just ignore the formulae that involve S.

The corresponding horizontal position in the adjusted diagram (call it x')is calculated as the cumulative fraction with (a lower S value and hence)a lower P' value, adjusting individual contributions so as to produce the chosen pretest prevalence. This simple operation looks complicated when formally expressed: Along the horizontal axis, each non-D or D subject is allotted a slot of width [q'/m] or [p'/n], respectively. Thus, the total allotments become m∙[q'/m] + n∙[p'/n] = q' + p' = 1, as intended. Let A be the number of non-D caseswith a lower value than the current subject, and B the analogous number of cases of D. When the current subject’s diagnosis is scored as D = 0 or 1, respectively,

x' = (A + (1 – D)/2)∙[q' / m] + (B + (D)/2)∙[p' / n] .(5)

The prevalence adjustment is brought about by the factors in square brackets (in an unadjusted, hence probably not so useful,predictiveness diagramthese factors would all be[1/(m+n)]). The small shifts, (1 – D)/2 and (D)/2, simply serve to position a subject’s plotting mark midway (centrally) in his/her slot.

When m or n is small, an academically more correct step curve should be used. In particular, the area rules above apply to the step-curve version. The current subject’s slot extends from

(A)∙[q' / m] + (B)∙[p' / n] to (A + (1 – D))∙[q' / m] + (B + (D))∙[p' / n] ; (6)

cf. expression (5), which represents the mid-slot abscissa. Left and right, respectively,of the current subject’s slot the ordinate jumps from the preceding P' to the current subject’sP' and from thatP' to the next higherP' in the dataset. A step curve results.

Tests and confidence limits

Much logistic software allows confidence limits for the individualS and hence for P or P' to be computed. These limits may, of course, be added to the individual points. In principle, a distinction should be made here between limits that take p' as known (random uncertaintyin the estimation of L, only) and the kind of limits that include estimation error in p'; in practice, the difference will hardly by noticeable. Supplying confidence limits for a point on the curve as such (say, around the posttest risk of 0.15 given abscissa 0.72as in the figure, and vice versa) is a much tougher question, possibly addressable by means of a bootstrapping procedure.

Returning to the individual, e.g., upper, confidence limits for estimated P' values, their distribution may be inspected. In particular, when they are plotted the way the variable P' was plotted, one gets what may be called a benefit-of-doubt predictiveness diagram. The interpretation of a point (x', y) on this curve may sound as follows: In consideration of the statistical uncertainty inherent in concluding from a dataset of the limitedsize available today, one will have to treat an estimated fraction (1 – x') of new cases if one wants to treat all patients who have a disease riskthat cannot at presentconfidently be declaredto be < y. This will offer a benefit of doubt to those patients whose test results are atypical and whose P' estimation is therefore less certain than in other cases. (For the fraction (1 – x') itself no confidence warranty is given: we are inspecting the raw distribution of individual upper confidence limits and cannot provide any ‘second-order’ confidence statements, except possibly via some bootstrap route.)

(1) Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM et al. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol 2008;167:362-8.