7 Truncated studies: Was the trial stopped early for “overwhelming” evidence of benefit or futility?

Studies may be stopped early for efficacy as part of an ethical obligation to not expose participants to less effective treatment (or placebo) any longer than necessary. In other words, once it is sufficiently clear that an intervention is efficacious, there is reason to end the trial.

However stopping early runs the risk of overestimating the effect size of the intervention. The estimate of effect will randomly vary around the true effect over time (with more fluctuation with fewer events early in the trial), so interim looks may lead to premature stop due an exaggerated estimate of the true effect size.

Consider the following simulated trial where there is no true difference between the groups (i.e. RR = 1.0):

Graph 2. Relative risk vs. number of events in a simulated trial. Created via Microsoft Excel using the RAND function to generate randomized event-data for two groups.

As depicted in Graph 2 above, there is random deviation from the true effect as events accumulate. If the trial had interim analyses for benefit every 100 events, and the threshold for statistical significance was kept at  the standard p<0.05 without accounting for interim looks, then the trial may have stopped at 100 events when the RR was 1.3, which we know to be an exaggeration of the true effect (RR = 1.0, i.e. no effect).

As a simplified example, imagine studying a chess player and trying to assess if they are an above-average player (and by what margin) by judging their win percentage. One approach is to wait 50 matches, then assess their win percentage and judge accordingly. However, this could waste time as it might be unnecessary to wait that long if they are quite skilled (e.g. winning 90% of their first 10 games). So instead there could be an assessment of skill every 5 matches (up to a maximum of 50 matches). If they seem sufficiently impressive at one of these midpoint assessments, then the observation could be stopped. While this might save time, it also has a risk: if by pure chance the player goes on a win streak, then the observation is likely to end early. Even if our player is truly above-average in skill, an early stop is most likely to occur when they are on such a hot streak, consequently introducing bias into our assessment (e.g. assessing their win probability to be 80% due to the win streak, when in fact it is only 60%).

This is the major concern with stopping rules: there is a systematic tendency for an early stop to be an overestimation. While such precautions cannot prevent bias towards overestimation, they can help reduce the extent of this bias, as discussed below.

Checklist Questions

Was there a predefined interim analysis plan with a stopping rule?
Did the stopping rule involve few interim looks and a stringent p-value (e.g. <0.001)?
Did enough endpoint events occur?

Was there a predefined interim analysis plan with a stopping rule?

If there is no pre-planned stopping rule then there is no assurance that sufficient safeguards were in place to minimize bias from early stops.
E.g. In JUPITER (Ridker PM et al.), a RCT of rosuvastatin vs. placebo in a highly-selected primary cardiovascular prevention population, the pre-planned stopping rule was mentioned, though poorly described, in an early report: “Frequency of interim efficacy analyses and rules for early trial termination have been prespecified and approved by all members of this board.” 

Did the stopping rule involve few interim looks and a stringent p-value (e.g. <0.001)? 

As the number of interim looks increases, then the probability of finding a false positive or overestimation also increases. This can be mitigated by (1) minimizing the number of interim looks and (2) having a stricter threshold for statistical significance that accounts for these multiple interim analyses.

Some common interim analysis strategies used (Schulz KF et al.) are:
Pocock: To keep the overall trial p-value threshold (alpha) = 0.05, the number of interim analyses are pre-defined & all have the same adjusted statistical significance threshold (i.e. p<0.029 for 2 planned analyses, p<0.016 for 5 planned analyses, and so forth).
Peto: Assign the final analysis p-value threshold = 0.05 (like in a conventional trial), but have a more stringent threshold (i.e. p<0.001) for the interim analyses.
O’Brien-Fleming: Begin with stringent interim analyses that start conservatively and then successively ease as they approach the final analysis (e.g. for 3 interim analyses & a final analysis, sequence of p-value thresholds 0.0001, 0.004, 0.019, 0.043)
Lan-DeMets: An adaptable approach where the significance level changes and analysis timing changes in accordance to previously observed information.

E.g. JUPITER (Ridker PM et al.) was stopped after the first of two interim analyses using “O’Brien-Fleming stopping boundaries determined by means of the Lan-DeMets approach,” (which requires a p-value <0.005). The actual p-value for the primary endpoint was <0.00001. 

Did enough endpoint events occur? 

Trials stopped early for benefit exaggerate the relative effect of an intervention by an average 29% compared with trials that conclude as planned (Bassler D et al.). As events accumulate, the fluctuations in effect size measures will become smaller and there will be less risk of bias (see graph above). Optimally ≥500 events (Bassler D et al.) should occur before stopping, after which the exaggeration decreases to an average of 12%.

For these reasons, skepticism is warranted for any relative risk reduction (RRR) ≥50% generated in truncated trials with <100 events (Pocock SJ et al., Montori VM et al.). The larger the number of events and the more plausible the RRR (e.g. ~20-30% is typical for the impact of cardiovascular pharmacotherapy on cardiovascular events), the more believable the results.

E.g. In JUPITER (Ridker PM et al.), 393 primary (composite) endpoint events occurred between the two groups by the interim analysis. The RRR for the primary endpoint was 44%, and the RRRs for individual components ranged from 18-54%. 
definition

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

NERDCAT Copyright © 2022 by Ricky Turgeon is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book