Kern

p 1

Supplemental document to “Why your new cancer biomarker may never work”:

Flaws to anticipate and

recognize in biomarker development

Scott E Kern

Introduction to supplemental document.

The parent document listsnine succinct categories of flaws observed in biomarker studies and the major features of each category. How these flaws emerge and why they are significant may not be intuitively obvious. The statistical nature of biomarker discovery can be somewhat obscure or even seem counter-intuitive to investigators trained in conventional hypothesis-testing and in the descriptive exploration of low-dimensional data. This supplemental document fleshes out thereasoning needed to understand common patterns of biomarker failure. Illustrative examples are presented from the biomarker literature. The purpose is not to compare technical approaches so as to establish which may be better or worse, powerful or underpowered, flexible or inflexible. The purposeinstead is to convey an intuitive understanding of major failure categories, an understanding that is rather agnostic as to the technical alternatives.

To acknowledge fully the high likelihood that any given marker will fail, the reader may benefit from realizing that there is a-priorimore than a 99% chance that his/her own favorite new marker will eventually fail.

1.

Markers providing a valid disease classification may fail to be implemented due to lack of clinical utility.

Your marker will not change the process of decision-making in the clinic

When a biomarker is used to classify a patient, the effort is an attempt to predict another feature, such as presence of disease for diagnostic markers, progress of disease for prognostic markers, and response of the disease for treatment markers. Yet, even when a marker provides a valid disease classification suggesting potential improvements in diagnosis, prognostication, or advising on optimal treatments, a marker may still lack clinical utility(1).

It is possible that no realistic alternatives exist among the list of choices that the clinicians have available, even if the patient’s diagnostic subtype or prognosis were reclassified by the biomarker. In this situation, the results of the biomarker assay cannot cause a change in clinical practices even were its performance characteristics superb.

On the other hand, the marker may make distinctions that, in clinical terms, are too small in magnitude to convey any benefit, even were the patient care amenable to change. There is a common misconception that identifying a statistically significant difference is tantamount to identifying a meaningful change. This idea is incorrect. Perhaps this misconception arises from a comparison to research into therapeutic improvements (1), where even a small change is valued, provided that the change is real.

The most familiar statistical testsexamine whether an association of a marker and a clinical characteristic is present. Specifically, the p value of common tests suggests whether a similar difference could have emerged readily among randomly generated data patterns; it does not judge whether the association has an adequate robustness or magnitude to be valuable. In other words, nonrandom patterns, as well, can be clinically useless.

A common example of the “magnitude problem” is seen when investigators try to interpret a Student’s t-test that has produced a statistically significant p value. The t-test examines whether the mean assay values of two populations differ enough that random patterns are an unlikely explanation. Seldom, however, is the mean value for a population of any importance for using a biomarker in making choices for the appropriate care of an individual patient. This is because the assay values for the individuals within the two populations may overlap greatly. If a dot plot were used to descriptively display the data, instead of using an analytic technique such as a statistical test, the error would become immediately obvious as the overlap in the range of assay values was observed clearly. The error is not merely that the Student’s t-test (or any broad means to compare two populations) is the wrong choice of test, but that it is excessively sensitive to low-magnitude differences.

A similar problem emerges when the study is designed to be “highly powered”. Power refers to the study’s ability to detect differences of small magnitude. When discovering biomarkers, therefore, the use of large populations, and the common emphasis on “powering” the study, deserves some circumspection. The converse could be argued, that if the marker were truly useful in making clinical distinctions, one should only need a small number of cases to demonstrate its worth.

Other broad statistical comparisons of populations have received similar criticism. The odds ratio or the somewhat analogous hazard ratio (such as the beta value provided by a Cox proportional hazards analysis) compares the frequencies of a characteristic or an outcome between populations. When the populations truly differ in this characteristic, the ratio should differ from “1”. To determine whether the difference could readily be produced by random data patterns, one examines a p value (such as the p value for each beta value of a Cox analysis). When the p value is low, investigators may reasonably conclude that the odds of having the characteristic or outcome will truly differ between the populations. Intuitively, the higher the odds ratio indicated for the population at higher risk, the more clinically important should be the biomarker. Here, intuition can fail the investigator. Even among published biochemical assays, odds ratios of at least 3 or 4 are reasonably difficult to find and thus are perceived to be of some clinical importance. Certain flaws were explained by Kattan, including that the sophisticated calculations of the beta value and the p value from Cox analysis are not very reliable (2). The use of simple odds ratios is also deeply flawed, as explained by Pepe et al (3). Odds ratios near 3 do not indicate that randomly chosen individuals differing in the biomarker status will reliably also differ in their likelihood of having the characteristic or outcome of interest. Pepe et al illustrated the argument noting that the attractive odds ratios cited for a prominent breast cancer biomarker (4)obscured the fact that the test’s false positive fraction (defined as 1 – specificity) was unattractively high, at 42%.Odds ratios often need to be in the hundreds - two logs higher than is typically reported - in order to convey clinical robustness, i.e., for the biomarker’s values to reliably construct two newly defined groups having distinct clinical characteristics. Conventional clinical tests, such as a very low heartrate, a very low blood pressure, absence of brain electrical activity, or the pathologist’s diagnosis of metastatic cancer on a biopsy, convey odds and hazard ratios firmly positioned in the thousands.These tests define clinical groups that essentially do not overlap. In contrast, owing to the typical molecular biomarker-defined groups overlapping greatly in their clinical features (i.e., the populations really do not differ that much), one cannot use the biomarker results to provide different care plans to the two groups.

Thus, plentiful evidence exists that many investigators focus on the least-valuable performance criteria when evaluating a predictive biomarker. As explained by Pepe et al and Kattan et al (2, 3), the more useful criteria include a test’s positive predictive value, which is superior to using the marker’s specificity, and the negative predictive value, which is a better metric than its sensitivity. Also, a biomarker should be able to improve the predictive accuracy when it is added to the other predictive considerations (such as age, diagnosis, type and stage of disease, etc.). The marker’s ability to improve upon predictive accuracy can be measured by the incremental model predictive accuracy. Calculating the concordance index providesa value for incremental comparison, specificallythe probability that, given two random patients and a given model, the patient with the worse outcome is, in fact, predicted to have a worse outcome.

Among the less-valuable criteria would be the ROC (receiver operating characteristic) curve, which commonly is used to judge the predictive and diagnostic accuracies of a marker whose values are quantitatively measured.The ROC curve plots the relationship of sensitivity and specificity; the AUC (area under the curve, or c-statistic) is a summary statistic from the ROC plot. Yet, neither the ROC curve nor the AUC convey the clinical utility of the marker. In other words, one may become impressed by features of the ROC or the AUC despite the marker having a low magnitude of clinical utility. To give a better indication of the magnitude of its utility, one should instead determine whether the test causes patients to be reclassified from one risk category or diagnostic category to another (5, 6) and whether a test accomplishes this reclassification better than do other competing tests.

Some biomarkers may not reflect the practical considerations of clinical settings. Due to a logical fallacy or due to a focus on other issues, this shortcoming in practicality may not be noted early on during biomarker development. For example, for decades there have been efforts to study “field defects” of organs that cause regions of tissue to have a higher risk than other regions. At first consideration, a biomarker for such “field defects” could offer a useful diagnostic clue. One would then want to identify the major hurdles in developing this approach. In Shen et al (7), the authors suggested that identifying field defects in colon carcinogenesis is difficult because “identification of genes that clearly identify individuals at (high) risk for colon cancer has been lacking.” This line of reasoning embodies a logical flaw. A colon hypothetically harboring a field defect, i.e., a patch of mucosa carrying a higher risk than the surrounding regions, need not provide an overall higher risk for the individual if the patch were relatively small. In such an instance, the background level of risk from the rest of the colon and rectum may still provide the dominant determinant of overall individual risk for cancer. The field defect could not be expected necessarily toaffect detectably the epidemiologic risk. This same biomarker report also illustrated another practical hurdle. Shen et al evaluated a marker that was associated in diseased persons with the presence of adjacent cancer. Although at a lower rate, 12% of normal persons were positive for the marker as well. If the biopsy were to have an overall false-positive rate of 12%, and if two dozen regions of a colon and rectum would need to be assayed in order to identify a given localized “field” having a defect , then one expects that nearly all of the screened normal individuals would be marker-positive in at least one region. These two hurdles are not merely a problem of low magnitude risk or of false positives. It is due to the particular nature of the clinical setting, in which multiple regions of an organ contribute epidemiologic risk and would require screening in order for the biopsy-based marker to be useful.

The biomarker need not strictly underperform in order to fail in its clinical application. Indeed, the marker may result in overdiagnosis. Overdiagnosis is the phenomenon by which cancers “are diagnosed that otherwise would not go on to cause symptoms or death” (8). Improved levels of sensitivity in imaging, biomarker, and cytologic technologies (such as mammography, chest x-ray, sputum analysis, and blood PSA measurement) already result in the diagnosis of cancer far in excess of what is observed by longitudinal observations of similar, but unscreened, populations. This excess is a measure of the degree of overdiagnosis (8). Overdiagnosis causes harm due to undeserved treatment, additional patient anxiety, a weakening of trust in the practice of cancer diagnosis, additional medical cost, and the dilution of cancer-treatment programs and clinical trials by cancer patients who paradoxically lack the expected increased risk of cancer morbidity and mortality carried by cancer patients diagnosed using older practices.

A biomarker may also be inappropriate for clinical application due to having arisen from a training population that was inappropriate, were the intended clinical use truly to be considered. This can occur when the tested population represents the wrong stage or carries a diagnostic bias. Biases and inappropriate populations are discussed in separate sections below as a cause of invalid markers, but here we are interested in ostensibly valid biomarkers that are by design divorced from the intended, quite attractive, clinical application. For example, the FDA in official letters criticized a commercially available ovarian cancer screening test (9, 10). The test had been evaluated in symptomatic patients but was proposed for use as a screen of asymptomatic persons, the latter representing an earlier stage of the disease, a stage that would be more difficult to detect, and a stage not yet evaluated for marker performance. The lack of a pre-symptomatic population is a common flaw in diagnostic biomarker studies. This lack, and the substitute testing of a later stage of the disease, leads to systematic over-performance of the evaluated markers. The markers, if tested in the pre-symptomatic setting, would perhaps be found to offer no clinically useful information. In the absence of samples from pre-symptomatic patients, the development process for diagnostic screening markers can be an empty and misleading exercise.

As was just introduced, the tested population may have a diagnostic bias, in which the population is either enriched or stripped of features, rendering it inappropriate for marker development. For example, a control population lacking the disease of interest is likely to have already been significantly depleted of relevant featuresowing to these persons’ prior medical care. Specifically, a control group for a study of a prostate cancer marker can be expected to be depleted of persons having any of the following features: elevations of the biomarker PSA, incidents of cancer, or symptoms related to cancer. Subsequently, to compare these “matched controls” to prostate cancer patients is to compare an irrelevant, featureless, medically sterile population stripped of natural variation. Quite counter-intuitively, such comparisons may generate more false positives, the greater the effort expended to ensure a “high-quality” control group. A related bias was found to arise in case-control studies when the controls were restricted to persons who did not develop the disease or who did not have disease progression by the end of the followup period. Such controls are termed “pure control sampling” (11). Instead, one could use an unrestricted “incidence density sampling” of the controls. The latter, while obviously less “pure”, produces the desired clinically-relevant result when estimating relative risk.

Additional explanation may be useful, for this is likely to be an unfamiliar area for many biomarker investigators. Diagnostic biases in patient populations are the natural analog to the “standardization fallacy” seen in cellular and animal experimental models(12). The process of standardizing a model or assay acts to reduce individual differences, the better to permit the study of group differences. Because not all experimental differences seen between groups are valid outside the test situation (i.e., they have low external validity), the findings can surreptitiously become enriched in such non-robust features. Indeed, even artifactual results can become impossible to detect once they are built into the standardized conditions (12). This over-defining of the control group is the clinical analog of the experimental standardization fallacy.The extreme diagnostic bias is presumably seen when using a model instead of real subjects. Models inherently, through their uniformity and not simply through their artificiality, make the results less relevant to the natural world. “This tradeoff between a tightly controlled subject population and generality in the real world is a common tension in most biomedical and clinical research” (13).

The standard of care for the cancer is too ingrained to permit testing of alternatives

A marker may indeed be valid, identifying useful subsets of patients that should be managed differently, and yet no change in clinical care may be permitted. Breast cancer, for example, has an extensive clinical literature and legal case law. This enforces a rigid “standard of care” that is resistant even to seemingly justified modifications. Financial incentives may impede changes in clinical practice. Clinical protocols may be deeply ingrained in insurance codes or other professional documents.

Your marker identifies a condition too rare to seriously consider further

Rare diseases permit intriguing opportunities for novel discoveries, and they involve few ingrained standards of care. Yet, the rarity may prevent extended study ofthe new marker. The condition itself may be rare, or the special decision, upon which the marker status bears, may be too infrequent.

Infrequency may also characterize some conditions among the common diseases. For example, even among the frequentcancer types, the dominant variables driving decisions regarding clinical care (such as stage, histologic type, prior treatment, etc.) may divide patients into increasingly smaller subgroups that reasonably cannot be expected to be further subdivided in practice by the new marker.