Comment November 2009 by Jørgen Hilden (J

Comment November 2009 by Jørgen Hilden ()

“Criteria for Evaluation of Novel Risk Markers of Cardiovascular Risk. A Scientific Statement From the American Heart Association,”

by a panel under the AHA, chaired by Mark A. Hlatky & Philip Greenland.

Circulation 2009; 119: 2408-2416.

This Statement from a prominent expert panel under the American Heart Associationhas undoubtedly triggered a debate in many circles. My own inspiration stems from thestudy group responsible for the prognostic follow-up of the nearly 4400 participants in the CLARICORTrial. Further stimuli have come from internal discussions within my department, notably with Thomas Gerds.

The task undertaken by the Hlatky panel is important, and the text makes all the good points expected and offers an excellent overview. It is in the details that one finds elements which, if canonized by thejournals,as is no doubt the intention, will create a dubious precedent, will frustrate serious investigators and may ultimately hamper progress.

The matter is not without urgency. Recently we have experienced that editors and reviewers of manuscriptsadopt the ideas and phraseology of the Statement without realizingthat some of its proposals have been too rashly endorsed, are too sketchy to serve as a guideline, or are only partly applicable in the early research phases (cf., Statement, Table 3, “Phases of Evaluation of a Novel Risk Marker”).

In the points that follow, I have taken pains to ensure that each item can be read independently. So the reader may jumparound depending on his or her interest.

Clinical matters

[A] The psychology of prognostic counselling. In the Statement a scientifically correct risk assessment is the aim. Questions of fear and hope, trust and distrust, are not mentioned.Bypassed are even physical reactions to receiving an updated risk verdict (self-treatment, lifestyle decisions, etc.), although their existence is one of the reasons why risk assessment is more than a discretized matter of threshold crossing.

[B] Weighing benefit against loss. As explained in the Statement, clinical benefits accrue when improved predictions enable physicians to refine the sieves that guide patients to individualized treatment and counselling. The ensuing clinical benefits must be weighed against costs and human losses. In other words, marker evaluations necessarily have a decision-analytic flavour. A decision-analytic approach was also insisted on in Sander Greenland’s commentary(1) on the paper in which many of the Statement’s key ideas were first aired (2).Eventually, the Statement does cover this kind of thinking, esp. in the sections on “Cost-Efficiency of Novel Risk Markers” and “Phase 6” research,but I would much have preferred decision theory to be integrated into the preceding general sections.Amongst other things, this might have led to a downplaying of the roles of the IDI (integrated discrimination improvement)and c statistics.

[C] More is not necessarily better.In the section on “Additional Practical Issues” it is pointed out that a new marker may offer practical advantages that will allow it to oust one or more markers in current use, even if prognostic information may thereby be lost. However, the subsequent, more statistical, sections seem to take it for granted that, in diagnostics and prognostics, more is better: what is gained by introducing a new marker into your risk assessments cannot be negative. From a certain theoretical point of view this is true: you cannot lose anything by collecting more data. In statistical terms, you gain information unless the outcome and the new marker are conditionally independent given the existing markers. In that case the information gain is zero. It is never negative.

In the real world, things are not that simple: more data may mean less information– or added ignorance, if you prefer. Confusion increases whendoctors are without secure background information for interpretation of a new test. Biased background information makes things worse. Add the effects of cognitive overloading. For such reasons,using more markers may contribute negatively to healthcare even if they are painless and free.

[D] Risk is management-dependent. The Statement is about ‘Prognosis’ and its daughter concept named ‘Risk.’ A prognosis is a prediction of a patient diary; a risk is summary thereof, as exemplified by a probability statement about the occurrence of a stroke within the next year. These are strangely passive concepts here in the sense that their dependence on active intervention is not acknowledged (except in a passage on procedural risk). The text does speak of intervening to reduce the risk of high-risk patients. But otherwise its main framework leaves the fact unheeded that a prognosis is really a ‘bundle’ of predicted diaries, one for (i.e., conditional on) each possible course of management and treatment ((3), section 2.). It is as if there is only one possible course of action, which can then only be ‘wait and see’ or ‘treatment as usual.’ Modern interventional cardiology calls for more than that. A difficult aspect perhaps, but why leave the difficulty unmentioned?

[E] Therapy-guiding markers.Prognosis and risk cannot be divorced from therapeutic options. At one point the Statement authors mentiona different kind of (possibly genetic) marker that does not affect the risk assessment as such but tells the cardiologist whether a particular intervention is likely to be successful or to fail. Such a marker would have to be evaluated by methods parts of which are not covered by the present text. No criticism! But the text appears blind to the fact that any marker of the kind it does cover may possess a bit of this therapy-guiding feature (as may be unravelled by a randomized trial) on top of its serving as a risk refiner. Imagine a binary marker which, when ‘high,’ both signals an increased risk of arrhythmia but also an increased sensitivity to preventive administration of a particular type of antiarrhythmic (thus lowering the “treatment threshold” of Fig. 2A). Here the marker not only influences the risk but also the treatment threshold with which the risk is going to be compared.

In sum, the Statement makes a tacit independence assumption: markers do not affect treatment choice except by shifting the risk and thereby sometimes tipping the risk-cost balance.

Statistics

[F] Population-attributable risk. In the otherwise very appropriate passage on this topic I find that one assumption is swept under the carpet, viz. that the lowest-risk level of the marker, or marker-based risk score, must offer a realistically attainable low risk. If just 1/1000 of the population has an extremely low risk it would not make sense to use that level as baseline for a PAR:one cannot prescribe a risk-lowering measure to 99.9% of the population.

[G] The c-index – operational?The ROC (receiver operating characteristic) is an indispensable tool in the evaluation of diagnostic tests, and the AuROC (area under the ROC), which is the equivalent of Harrell’s c-statistic, is a popular summary measure. An investigator who plans to follow the Statement’s strong recommendation of this family of statisticswill discover that three issues are bypassed:

(i) Most prognostic datasets involve censoring. According to my colleague, Thomas Gerds, who is working hard on the censoring problem, it is still not completely solved (e.g., (4)), and the appropriate software is still in its making.

(ii) Which version of the c-statistic should be used? Dichotomized time, continuous risk; or dichotomized risk, continuous time; or both continuous (as briefly outlined on p. 2411)?

(iii) With dichotomization, how should the cutpoint or cutpoints be chosen; and if several cutpoints are adopted, how should the resultingc-statistics be combined?

[H] The c-index – a rational choice?The user of a prognostic statement is obviously looking forward in time. The risks involved are forward conditional probabilities, describing future outcome conditionally on present marker data. Aren’t we using the wrong (backward) direction of the conditionalization when we construct an ROC that compares marker distributions conditionally on short vs. long survival and use that for a c calculation?Sander Greenland (1) has all the good arguments against outcome-conditional analyses.Add to thisthe decision-theoretic shortcomings of the AuROC (e.g., (5)), which carry over to c.

Overall,the c-index does not, I fear, give an interpretable answer to the question one would like it to answer. In fact, if it did, the authors of the Statement would no doubt have included one or more examples in which a c-value is interpreted in a clinical context.

[I]On sensitivity to change. The Statement has one piece of warning, though: “the c-index is relatively insensitive to change.” Well, by what standard? My own calculations indicate that any rational standard (in my universe it must be a decision-analytic measure of gain and loss) will confirm that new markers offer very little clinical gain when added to an already well-performing prognostic rule. I.e., although the c-index is not of the decision-theoretic family, it does behave correctly in this respect. The fault is not, it seems, with the c-index or its decision-oriented rivals but with the users, who have unrealistic expectations: ‘when the new marker produces a hazard ratio of 2.5 and a P-value < 0.001, why doesn’t it make our prognostic assessments considerably sharper than before?’

It is fairly easy to demonstrate why it is so. The situation is governed by Pythagorean addition of information sources. The hypotenuse, representing the new prognostic rule, is never much longer than the longer cathetus thatrepresents the old rule. (In a Gaussian case this is more than a metaphor:On a risk-determining scale, suppose Δ and δ are the means of the old marker score and the new marker, respectively, and σ and τ the associated SDs. Assume for simplicity that old and new markers are independent. The information-governing ratio, Δ/σ, then changes to √[(Δ/σ)2 + (δ/τ)2].) The smallwavecreated by the new marker – even if it is much more than a ripple – drowns in the random billows of prior variation.

Incidentally, counting threshold crossings(e.g., (2)) cannot circumvent this problem: the small wave is simply invisible, and discretization can only add noise. It would be nice if it weren’t so, but Monte Carlo studies confirm that it is.

[J] Valuation of risk shifts. To examine the essence of the‘gain-by-shift’ idea (see Table 2 and Figure 3),assume for the moment that we have a new, inexpensive marker which is exclusively risk-modifying in the sense that it does not affect treatment choice except to the extent that its value may shift risk levels up or down (in the context cited, markers are meant to satisfy this condition; see [E]). Then the investigator will have to ask: what have we gained by observing that the risk changed from p to q, considering that the event concerned did not (or, did in fact) materialize? If you allow me to be a bit more formal for a moment, the decisive measure of prognostic improvement must rest on the average over the clientele of a patient-wise ‘reward function’ of the general form,

U(p→q, (entry data Α, outcome Ω)),

designed to capture the (positive or negative) advantage one would attachto observing that, with addition of the new marker, a patient’s risk changed from p to q, thepatient’s entry data and outcome, (Α, Ω), being taken into account as appropriate.Why not list the desiderata that the four-place functionU(p, q, Α, Ω) should preferably possess? Why not grasp the nettle?

Or, by the way, do we have a gardener’s glove? Whether the p and the q systems are well calibrated or not, the use of proper scoring ruleshas much to recommend it. In that case, the gain is simply the increase in the score quantity from p to q, averaged over patients. Unfortunately, proper scoring rules for continuous outcomes, such as survival time, isa fairly difficult subject (6), and, as with the c-index, censoring adds further problems.

The key properties of proper scoring rules are two: (1) Dishonesty does not pay, only honest assessment of all available data does. (2) Contestants cannot beat their rivals by exploiting knowledge of the machinery by which their performance is going to be scored. The IDI, on the other hand, is improper in this technical sense. Even without a new marker a cardiologist can cash a maximally positive IDI reward by the following simple stratagem. It only assumes that he knows the approximate marginal frequency, α, of the event concerned. Suppose the established, well-calibrated system declares a given patient’s risk to bep. With p available to him but no new data, the cardiologist cleverly chooses hisqin an extreme manner: q= 1 if p > α, q = 0 if p < α. (If he isn’t sure which of p and α is greater, he may play safe by setting q= p.) In other words, it pays to show maximal overconfidence.

By the way, Pencina et al. (2) proposed a related statistic called the NRI (net reclassification improvement). It appears to have been given up the Statement panel. Here the same stratagem is applicable. Admittedly, the stratagem capitalizes on the IDI and sacrifices calibration.*

Pedagogical matters

[K] On the orderly progression of scientific questions.Instead of criticizing scattered sentences that fail to be clear, at least to me, let me say that I feel that the logical progression of questions underlying the section on “Measures of Risk Prediction” does not become clear. After dealing with the first step (significance testing), the text proceeds to effect estimation, etc., and ultimately arrives at measuresof clinical benefit and cost-efficiency. I would have liked the text to highlight the compelling logic of the underlying sequence of questions: Can we convincingly show that the marker’s prognostic content isn’t zero? How great is it then?Is it large enough to be useful in practice? Does it provide value for money? Instead, transitions use phrases like “another common measure,” as if significance testing, effect estimation, etc., were just a fan of tools offered on an equal footing. Fortunately, Table 1 and later sections help the novice perceive this structure.

At each step of the progression, the text tends to direct some critical remarks at specific statisticsor phrases, such as “the likelihood ratio partial χ2,” when in fact any similar statistics, regardless how the software-makers have baptized them, must of necessity share the shortcomings concerned because they all address the same question and not a different question. The reader should understand that each statistic sees its own side of marker performance, only, but there are families of statistics that see essentially the same side.

[L] The percentages given in Table 2 were borrowed from Nancy Cook’s comment on the Pencina paper (2). They look reasonable to me, but strictly speaking the corresponding estimated expectations should have been given and any discrepancies commented on. Otherwise, novice readers will hardly be able to appreciate the figures or the overall conclusion that the risk levels obtained by introducing high-density lipoproteins into the model were more accurate. The 2 times 3 marginal percentages would also be natural to make available. In fact, better use could have been made of the wealth of ideas brought to the fore in the comments on (2). Not even the penetrating commentary by Sander Greenland (1) is cited.

A philosophical point

[M] The risk concept. If you examine the wording of the Statement closely or, indeed, the preceding text, you may discover that in many places we vacillate between two views: Is risk located in the doctor’s head or in the patient’s heart? Does the probability we call risk exist in the patient’s body and dailylife regardless of whether any doctor is assessing it? Or is it created in the doctor’s head in the process of assessing it? In particular, what is it that changes with a lab report: the objective in-heart or the observer’s in-head risk?

Notably the phrase “Risk Prediction,” which the Statement uses in several places, suggests that the risk exists prior to its being assessed (predicted). According to the alternative view, ‘risk prediction’ does not mean ‘prediction of a risk’ but simply ‘prediction phrased in risk terms.’

Once this philosophical issue catches one’s eye, it crops up time after time again.My speculations have convinced me that the in-head conception is the less contradictory choice. However, we have to live with both views, on guardagainst the ambiguities that that entails.

References

(1) Greenland S., The need for reorientation toward cost-effective prediction: Comments on [ref. (2) below]. Statistics in Medicine 2008; 27: 199-206.

(2) Pencina M. J., D’Agostino R.B. sr., D’Agostino R.B. jr., Vasan R.S., Evaluation of the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine 2008; 27: 157-152; comments pp. 173-206.

(3) Hilden J, Habbema JDF, Prognosis in medicine – an analysis of its meaning and roles. Theoretical Medicine … …

(4) Heagerty P.A., Lumley L., Pepe M.S. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 2000; 56: 337-344.

(5) Hilden J. The area under the ROC and its competitors. Medical Decision Making 1991; 11; 95-105.

(6) Hilden J. Scoring rules for evaluation of prognosticians and prognostic rules (workshop OdenseUniversity, 1999, revised 2009):

*passage corrected22MAR2010