Merits and Limitations of Meta-analyses

Daren K. Heyland, M.D.

Kingston General Hospital, Angada 3, Kingston Ontario, K7L 2FF, Canada

phone +1-613-549-6666, fax +1-613-548-2577, e-mail:

Learning Objectives

·  To describe the methods used by systematic reviews to limit bias

·  To list the assumptions inherent in comparing randomized trials and meta-analyses

·  To list the criteria used to evaluate the validity of meta-analysis

·  To list the various ways of handling heterogeneity in a meta-analysis

Systematic Reviews

All reviews, narrative and systematic, are retrospective, observational research studies and therefore, are subject to bias. Systematic reviews differ from narrative reviews in that they employ various methods to limit or minimize bias in the review process. These methods include a comprehensive search of all potentially relevant articles, the use of explicit, reproducible criteria to select articles for inclusion into the review process, quality assessment of the data incorporated into the review and a transparent, reproducible process for abstracting data. The key features of both narrative and systematic reviews are outlined in Table 1.

Meta-analyses are a subset of systematic reviews that employ statistical strategies to aggregate the results of individual studies into a numerical estimate of overall treatment effect. Systematic reviews are less likely to be biased, more likely to be up-to-date and more likely to detect small but clinically meaningful treatment effects than narrative reviews [1,2]. For example it has been shown that narrative reviews may lag behind by more than a decade in endorsing a treatment of proven effectiveness or they may continue to advocate therapy long after it is considered harmful or useless [1].

Controversy Between Randomized Controlled Trials and Meta-analyses

Recently, meta-analyses have been coming under increasing scrutiny [3]. There has been increasing uncertainty regarding the validity of meta-analyses stemming from comparison of the results of large randomized trials and meta-analyses on the same topic. Such uncertainty likely deters both clinicians and decision makers from incorporating the results of meta-analyses into their decision making. While cautious interpretation of meta-analyses (and randomized trials!) is a good ‘rule of thumb’, discriminating against meta-analyses is not supported by current evidence, as I will explain below.

LeLorier and colleagues compared the results of 12 large randomized trials (>1000 patients) published in four leading medical journals to 19 meta-analysis previously published [4]. They found significant differences in the results of these comparisons and concluded, “...if there had been no subsequent randomized trial, the meta-analysis would have led to the adoption of an ineffective treatment in 32% of cases and rejection of useful treatment in 33 percent of cases.” The person writing the accompanying editorial in the New England Journal of Medicine said, “.. I still prefer conventional narrative reviews”[5]. It is unclear to me why this author would have concluded that narrative reviews are superior to systematic reviews. The argument here is with meta-analysis, the practice of reducing the measure of effect into one number, not with systematic reviews.

Feature / Narrative Review / Systematic Review
Question / No specific question,
usually broad in scope / Focused clinical
question
Search / Not usually specified,
potentially biased / Comprehensive,
Explicit strategy
Selection / Not usually specified,
potentially biased / Criterion-based
selection
Appraisal / Variable / Rigorous critical
appraisal
Synthesis / Qualitative / Qualitative +
Quantitative
Inferences / Sometimes evidence-based / Evidence-based

Table 1: systematic and Narrative Reviews.

From: Cook DJ, Mulrow C, Haynes RB. Systematic reviews: synthesis of best evidence for clinical decisions. Ann Intern Med 1997; 126: 376-80

Subsequent letters to the editor have also refuted the findings of LeLorier and colleagues based on the fact that their methods may have inflated the measure of disagreement [6]. By selecting 12 trials from four leading medical journals, the authors clearly have a non-representative sample of clinical trials. Such journals tend to publish trials whose results disagree with prior evidence. In addition, the authors based their agreement statistics on the presence or absence of a statistically significant p value and ignored the fact that the point estimates may have been similar (although the confidence intervals may be different). In addition, the credibility of this work is undermined by the fact that the authors seemed to be selective in their choice of comparators. For example, the authors cite discordance between the 1993 EMERAS trial and a 1985 meta-analysis of thrombolysis for AMI. As suggested by David Naylor, perhaps a more valid comparison would be ISIS-2 (a more definitive test of the hypothesis generated from the 1985 meta-analysis) or a 1994 meta-analysis that used individual patient data from all trials of thrombolysis that had more than 1000 patients [7]. Others have challenged the notion that large randomized trials are the gold standard to which the results of meta-analyses should be compared [5]. Potential biases occur in both randomized trials and meta-analyses and neither should be considered the gold standard in the absence of rigorous assessment of the methodologic quality. Finally, it is not surprising that the results of meta-analyses differ from the results of randomized trials; they may be measuring different things. Given the variable characteristics of patients, interventions, outcomes and methods across studies included in a meta-analyses, discrepancies with large trials should be expected [5].

By carefully exploring the discordant results of meta-analyses and randomized trials, one can gain important insights into the treatment effect of the study under investigation. For example, a recent meta-analysis suggested that calcium supplementation may prevent pre-clampsia [8]. A subsequent large, NIH-sponsored, clinical trial concluded that there was no treatment effect in healthy nulliparous women. DerSimonian and colleagues explored the heterogeneity across studies in the meta-analysis and the inconsistent results with the large trial. They stratified studies in the meta-analysis according to the baseline risk of preeclampsia (event rate in placebo-group). When they divided studies in the meta-analyses into studies of low and high risk patients, it was apparent that there was no treatment effect in low risk patients, consistent with the large randomized trial of low risk women, but there still was a significant treatment effect in the high risk patients.

In summary, bias can exist in both randomized trials and meta-analysis. Previous comparisons of the results of meta-analysis and randomized trials on the same topic have demonstrated concordance in 80-90% of cases [9,10]. Not surprisingly, on some occasions, discordance is present. By exploring the discrepancy between meta-analyses and randomized trials, one can gain important insights into the effect of interventions.

Validity and Generalizability of Systematic Reviews

While systematic reviews are useful tools to discern whether interventions are efficacious or not, they still need to be evaluated for their methodological quality. Not all systematic reviews (or randomized trials) are created (or completed) equally. What distinguishes a high quality systematic review from a low quality review? A recent publication in Critical Care Medicine outlined key criteria that need to be considered when evaluating the strength of review articles [11]. To assess the validity of systematic reviews (or meta-analyses if a quantitative summary is provided), one needs to consider whether a comprehensive search strategy was employed to find all relevant articles, whether the validity of the original articles was appraised, whether study selection, validity assessments and data abstraction were done in a reproducible fashion and whether the results were similar from study to study (statistical homogeneity). Obviously, we can make stronger inferences from studies that employ more rigorous methods (see Figure).

One of the weaknesses of randomized trials is their limited generalizability. Because meta-analyses combine many studies with subtle differences in patients, settings, interventions, etc., provided it is clinically reasonable to combine these studies and no statistical heterogeneity is present, the results of meta-analyses have greater generalizability than a single randomized trial. For example, a randomized trial of parenteral nutrition in surgical patients at Veteran Affairs hospitals in the USA (predominantly white males) would have limited generalizability. A meta-analysis of parenteral nutrition (TPN) in the critically ill patient combines the results of 26 studies of different patients in different settings and therefore offers the best estimate of treatment effect that is generalizable to a broader range of patients. This is consistent with the perspectives of decision makers who are more concerned with the effects of health care on groups of patients, rather than the individual.


Strategies to Handle Heterogeneity in a Meta-analysis

One of the main objectives of meta-analysis is to combine “similar” studies to determine the best estimate of overall treatment effect. The question, “Are these studies ‘similar’” needs to be asked from a clinical and statistical perspective. It has to make clinical sense to combine several studies and to exclude others. For example, it may be quite inappropriate to combine studies of nutritional interventions in patient populations as diverse as neonates, obese adults and patients undergoing bone marrow transplantation. Whereas, it may make sense to combine studies of adult patients undergoing major elective surgery with studies of critically ill adult patients in that they share similarities in their metabolic response to injury.

Statistical tests of homogeneity (or heterogeneity) ask the question, “Are the results of the various included studies similar (or different) to each other?” If either clinical or statistical heterogeneity is present, it weakens, if not invalidates, any inferences from the overall estimate of treatment effect. Then the goal of further statistical testing is to try and explain why such differences occur. Strategies to deal with heterogeneity include: 1) exclude studies from the meta-analysis that, on the basis of clinical judgement, appear to be different, 2) meta-regression techniques, and most commonly, 3) subgroup analyses.

Subgroup analyses should be specified a priori and are based on differences in baseline patient or treatment characteristics and not events that occur post-randomization [12]. There is considerable debate whether the results of subgroups confirm hypotheses or should be viewed as hypothesis generating exercises. There are several criteria that one can assess to establish the strength of inference from subgroup analysis [13]. Because of the heterogeneity across studies and their typically larger sample size, systematic reviews can provide insights into important subgroup effects. For example, in a recent meta-analysis of TPN in the critically ill patient, there were 26 randomized trials of 2.211 patients that compared the use of TPN to standard care (usual oral diet plus intravenous dextrose] in surgical and critically ill patients [14]. Overall, when the results of these trials were aggregated, there was no difference between the two treatments with respect to mortality (risk ratio= 1.03, 95% confidence intervals, 0.81-1.31). There was a trend to a lower total complication rate in patients who received TPN although this result was not statistically significant (risk ratio = 0.84, 95% confidence intervals, 0.64-1.09). However, heterogeneity across studies precluded strong inferences based on the aggregated estimate and therefore, several a priori hypotheses were examined.

Studies including only malnourished patients were associated with lower complication rates but no difference in mortality when compared to studies of non-malnourished patients. Studies published since 1989 and studies with a higher methods score showed no treatment effect while studies published in 1988 or before and studies with a lower methods score demonstrated a significant treatment effect. Complication rates were lower in studies that did not use lipids; however, there was no difference in mortality rates between studies that did use lipids and those that did not. Studies limited to critically ill patients demonstrated a significant increase in complication and mortality rates compared to studies of surgical patients.

While we had set out to summarize the effect of TPN in critically ill patients, only six studies included patients that would routinely be admitted to the ICU as part of their management. Inasmuch as surgical patients and ICU patients have a similar stress response to illness, it was considered to be reasonable to aggregate studies of surgical and critically ill patients. However, the results of our subgroup analysis suggest that both mortality and complication rates may be increased in critically ill patients receiving TPN and these treatment effects may differ from the results in surgical patients. The results of studies evaluating the effect of TPN in surgical patients therefore may not be generalizable to all types of critically ill patients. This leaves a very limited data set from which to base the practice of providing TPN to critically ill patients and the best estimate to date is that TPN may be doing more harm than good. These subgroup findings have important implications for designing future studies. Obviously, the studies would need to be well designed. It would also appear that elective surgical patients should not be combined with critically ill patients and, for short-term administration, lipids could be omitted.

Conclusions

In the last decade, we have seen a proliferation in the number of published systematic reviews. Systematic reviews are an efficient way to synthesize current knowledge on topics of relevance. Systematic reviews provide the best estimate of overall treatment effect that is generalizable to the broadest range of individuals. Differences between the results of randomized trials and meta-analyses are to be expected and by exploring these differences, further insights into the effectiveness of interventions can be gained. To assess the validity of systematic reviews (or meta-analyses if a quantitative summary is provided), one needs to consider whether a comprehensive search strategy was employed to find all relevant articles, whether the validity of the original articles was appraised, whether study selection, validity assessments and data abstraction were done in a reproducible fashion and whether the results were similar from study to study. Obviously, one can make stronger inferences from meta-analyses that employ more rigorous methods. Moreover, because of the heterogeneity present in meta-analyses and their large sample sizes, they offer informative subgroup analysis. Systematic reviews and meta-analysis are an important research tool in illuminating effectiveness of our therapeutic interventions.

References

1.  Antman EM, Lau J, Kupelnick B, Mosteller F, Chalmers TC. A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Treatments for myocardial infarction. J Am Med Assoc 1992; 268: 240-8