Sample Size Re-estimation

1.Introduction

Despite the best effort, some of the crucial information used to design a confirmatory trial is not available, or is available but with a high degree of uncertainty, at the design stage. This could involve the initial estimates of within- or between-patient variation, a control group event rate for a binary outcome, the treatment effect desired to be detected, the recruiting pattern, or patients’ compliance, which all impact the ability of the clinical trial to address its primary objective (Shih, 2001). There are many reasons why reliable information about these types of parameters is not available or is barely available. An obvious reason is that the new product has not been studied in the manner intended in the target patient population. This situation is not uncommon with HIV treatment, for example, when a new product is to be studied in a confirmatory trial as part of a drug cocktail, and the cocktail contains some newly approved anti-retroviral medications. Another reason for lack of information could be that changing medical practice might significantly affect an event rate. Thus, estimated event rates based on historical data might no longer reflect rates in the current medical environment. A third reason could be a conscious decision by the sponsor to minimize the phase II program, resulting in less than ideal information at the time when the confirmatory trial is being planned.

Mehta and Patel (2005) discussed a situation where a sponsor prefers starting out with a small initial commitment of patients but is willing to factor in the possibility that the sample size might need to be increased during the course of the trial. The latter could be accomplished by taking an interim assessment of the treatment effect and re-assessing whether the initial plan of a smaller sample size remains a viable option.

There are also situations where the primary study objective is to collect enough patient exposure data to assess the safety of a pharmaceutical product. Because of patient dropout, the total exposure time or the number of subjects with the minimum exposure duration might at the end of the trial be less than planned if the current dropout pattern continues. As a result, there is a need to increase the sample size based on the dropout pattern observed while the trial is ongoing.

Offen et al (2006) discussed yet another situation where assumptions are made about the correlations among multiple co-primary endpoints when determining the sample size. Ideally, assumptions about the correlations should be based on data from similar trials conducted previously. However, sufficiently similar trials to determine sample size dependably might not exist.

When there is uncertainty about the assumptions made at the design stage, it may be prudent to check the validity of those assumptions using interim data from the study. If the assumptions appear erroneous, one might be able to make mid-course adjustments to improve the chance that the trial will reach a definitive conclusion. One such mid-course adjustment is to modify the sample size. In other words, the original sample size could be revised based on estimates derived from interim data. We will focus on this aspect of mid-course adjustment in this paper.

In this paper we will focus on sample size re-estimation (SSR) for phase III and phase IV studies. The discussion is relevant to both continuous and binary endpoints even though the basis for SSR might differ for those two cases. A condensed summary of the highlights from this paper was published in an executive summary of the PhRMA Adaptive Working Group (Gallo et al, 2006). A general discussion of the methodology is given in the next section. Recommendations related to operational aspects of SSR are given in Section 3. In Section 4, we briefly discuss two examples in which sample size adjustment was proposed as part of the original designs. In Section 5, we will discuss some evolving issues related to SSR that could benefit from further research.

2.Methodology

There are many sample size modification rules, each with a companion decision rule (test). The sample size modification rule interacts with the decision rule to determine the operating characteristics of the resulting design. Only calculations specific to each particular design can ensure that the design controls the type I error rate at the desired level. As theoretical development has progressed, simulation hasplayed an increasingly large role in verifying that the desired operating characteristics are maintained.

2.1Fully sequential and group sequential methods

For monitoring of serious safety or efficacy events, there may be value in evaluating whether or not to continue a trial each time an event is observed. Wald (1947) devised the sequential probability ratio test (SPRT) which can be applied in such a case. Siegmund (1985) gives a more recent treatment of fully sequential methods. As an example of continuous monitoring, consider the REST trial (see Section 4, Example 2) which evaluated the safety of a vaccine for rotavirus. A previous rotavirus vaccine had been withdrawn from the market due to an increased incidence of a serious event, intussusception. Normally, intussusception occurs in about one of 500 infants, and thus even a moderate increase in incidence will require a very large study to detect. The REST trial allowed evaluation of a stopping boundary following every case of intussusception to determine if there were a sufficient evidence of an increased risk to justify stopping the trial.

For most trials, the continuous monitoring required by fully sequential methods is excessive. Thus, group sequential methods have been developed. Under a group sequential design, interim analyses are performed after pre-planned numbers of patients have been enrolled and followed for a specified period of time. At the time of an interim analysis, pre-specified rules are applied to decide whether to stop the trial for a definitive finding, stop the trial due to futility for ever developing a definitive finding, or continuing the trial until the next planned analysis. As a result, the method can affect the size of a trial if sufficient evidence is available at an interim analysis to come to a conclusion. An extensive review of group sequential methods is given by Jennison and Turnbull (1999).

Another method of using a group sequential design to adjust trial size is to perform interim analyses based upon the amount of statistical information that has accrued, for example, after a fixed number of endpoints has been observed in a design requiring a number of events at the final analysis that is sufficient to detect a clinically meaningful treatment effect. If the overall event incidence is low, one can enroll more patients or lengthen the follow-up period to obtain the needed number of events. On the other hand, if the low overall incidence is a result of a large treatment effect, the trial might be stopped at an interim analysis based on conclusive early results.

2.2Blinded sample size re-estimation

Blinded sample size re-estimation uses interim data without unblinding treatment assignment to provide an updated estimate of a nuisance parameter in order to update the sample size for the trial based on the estimate. Nuisance parameters mentioned in this context are usually the variance for continuous outcomes or the underlying event rate for binary outcomes. Gould (2001) reviews methods of this type and comments that they are reasonably comparable in performance and also compare favorably with methods that utilize unblinded estimates. Kieser and Friede (2003) considered blinded sample size re-estimation using a blinded estimate of variance for a continuous endpoint. Blinded sample re-estimation is generally well accepted by regulators (ICH E-9, 1999). Mehta and Tsiatis (2001) note that, when viewed in terms of the information on the primary endpoint, the recruitment objective remains fixed under an information-based design and the trial could be considered “non-adaptive” from the information perspective even though the sample size might differ from what was originally targeted in the protocol.

2.3Unblinded sample size re-estimation

We will discuss methods summarized by Posch, Bauer and Brannath (2003), who conducted an extensive review of adaptive trial design methods. Gould (2001) also reviewed blinded and unblinded sample-size re-estimation methods and compared them as noted above. While these methods generalize in useful ways to other trial adaptations, we focus primarily on SSR here. The basic adaptation strategy is that one or more interim analyses may be planned, and at the time of any interim analysis the sample size may be changed based on unblinded interim results or other factors such as external information. The usual approach is to attempt to adjust sample size to provide desired power under a certain assumption of the treatment effect. Each approach needs to address two crucial questions: 1) how to do this and retain a desired type I error rate, and 2) what treatment effect should the trial be powered for at the time of the interim analysis?

2.3.1Type I error control

Combination tests are commonly used to control type I error rate in adaptive designs with SSR. The first combination test discussed combines test statistics from a pre-defined, fixed number of stages of a trial in a pre-defined manner. However, the sample size that is used to generate the test statistic for each stage is defined based on the results of the previous stages. Rules are put into place at the beginning of the trial for each stage based on the combined statistics through that stage. These include rules for stopping the trial for futility, stopping for superiority, or continuing to the next stage.

Group sequential designs are a simple case of combination tests where the sample size does not change as a function of interim results. This is most easily seen when comparing sample means for two samples. Differences in means from each stage of the trial are combined as independent increments from the results of each stage, with weights being proportional to the sample size for each stage.

Expanding on this example, some propose to use these same pre-defined weights for combining mean differences from different stages while allowing the sample size for each stage to vary based on results from the previous stages. By using the pre-defined weights the null distribution is preserved for the interim and final test statistics. Since the weighting of the normal statistics will not, in general, be proportional to the sample size for that stage, the method does not use the sufficient statistics (the unweighted mean difference and estimated standard deviation from combined stages) for testing, and is therefore less efficient (Tsiatis and Mehta, 2003). Additional discussion on efficiency can be found in Burman and Sonesson (2006) and Jennison and Turnbull (2006a).

Usually, combination tests are stated in terms of methods for combining p-values from different stages of a trial. A common method is Fisher’s combination test. Another method for combining p-values was proposed by Lehmacher and Wassmer (1999), who apply the inverse standard normal distribution to each stage p-value to get a standardized normal statistic. These normal statistics are then combined through a weighted sum to obtain a combined standardized normal statistic. The combination test p-value is the p-value for this combined statistic.

Although we discussed combination tests with normal random variables and a known variance, the results can generalize to other cases by using asymptotics. While combination tests may combine test statistics using methods other than the above, the comment on lack of efficiency is applicable to the general case.

An alternative method of designing trials with combination tests is to pre-specify the sample size, the futility rule, and the superiority rule only for the first interim analysis, and then to determine recursively at the time of each interim analysis what the sample size, futility rule and superiority rule are for the next stage. This is equivalent to assigning a weight for the first two stages of the trial at the beginning, and at the time of each interim analysis dividing the weight for the subsequent stage into two parts for the following two stages – until the point where it is decided to assign all of the remaining weight to a stage and stop the trial at the subsequent analysis.

2.3.2Sample size adaptation methods

Consider a difference of weighted normal means with a pre-planned sequence of analyses. At a given stage, we make several assumptions:

  1. We assume some fixed parameter value for the underlying mean difference.
  2. We assume a known common variance for observations within each group.
  3. We assume the cutoffs for decision making at future analyses for the trial.
  4. We assume sample sizes for remaining stages in the trial.
  5. There is a desired power for the remainder of the trial conditional on the current results.

Given assumptions 1-4 and a test statistic for data through the current stage, one can compute the conditional power for a positive finding at each of the future planned analyses. By further assuming that the proportion of future sample size is fixed for each interim, we can set the overall conditional power by varying the final sample size. Different choices have been considered for the underlying mean difference to use in computing the conditional power and determining the future sample size. Proschan and Hunsberger (1995) and Cui, Hung and Wang (1999) suggested using the observed estimate of the parameter at the time of the interim analysis. This has been criticized as inefficient by, for example, Jennison and Turnbull (2003). Liu and Chi (2001) and Posch, Bauer and Brannath (2003) suggest a fixed and predetermined value that does not depend on observed results. The latter authors demonstrated that for a two-stage design this approach (along with a fixed maximum sample size) could improve the expected sample size over a group sequential procedure. Bauer and Konig (2006) investigated SSR based on the conditional power approach, and demonstrated that mid-trial sample size recalculation based on an interim estimate might lead to an overly large price to be paid in average sample size in relation to the gain in overall power. As a result, they feel that using the estimated effect size for sample size reassessment appears not to be a recommendable option, but instead, using the original effect size from the planning phase would usually be a useful option.

2.4Optimized designs

Jennison and Turnbull (2006a, 2006b) consider expected sample size and power for group sequential and adaptive designs as a function of underlying treatment difference. In the first article they suggested that using an interim observed treatment effect to adjust sample size and get a desired conditional power could be very inefficient when compared to a group sequential design. In the second they compared pre-planned group sequential versus pre-planned adaptive trials using the expected sample size to achieve a given unconditional power for a fixed treatment difference of interest.

The work of Jennison and Turnbull, as well as research by Lokhnygina (2004) and Banerjee and Tsiatis (2005), has shown that it is possible to develop adaptive designs which are optimal in terms of minimizing a mixture of expected sample sizes over the specified range of values of treatment effect sizes. Jennison and Turnbull (2006a, 2006b) also found that optimal adaptive designs had at most a minimal improvement in expected sample size compared to optimal group sequential designs. Thus, bothflexibility and efficiency should be considered when considering a design with SSR.

3.Recommendations

SSR techniques offer potential for improving program efficiency by allowing mid-course sample size adjustment when assumptions made at the planning stage may have been unreliable. During protocol planning, it should be considered on a routine basis whether these techniques should be used in the trial. Potential advantages of SSR techniques do need to be balanced against possible procedural or logistic concerns, such as those which might affect trial integrity (e.g., does data need to be unblinded for the re-estimation, and if so who will be involved in review and decision making? might observers seeing how the sample size is changed be able to infer information about treatment effects which could have some potential to compromise trial integrity?). If an SSR approach is utilized, this should be implemented in a manner that to the extent feasible minimizes such concerns. Particular decisions regarding how to consider and implement SSR techniques will of course depend on the details of particular situations; nevertheless, some general recommendations and desirable characteristics are described below.

3.1 Planning

The need for sample size re-estimation should be anticipated as much as possible during trial planning. This applies not only to assumptions used in sample size calculation which might turn out to have been initially erroneous, but also to the possibility that changes might occur in the external environment while the trial proceeds which are relevant to sample size. For example, during a long-term trial advances in background therapy might lead to a lower event rate than had previously been observed. Allowing for potential SSR should by no means be viewed as a substitute for good up-front decision making; rather, this should supplement proper initial planning by being realistic about possible limitations of assumptions or anticipating background environmental changes.