2 Risk of bias: Are the results internally valid?

Randomization is the core of the RCT and ensures that the play of chance dictates whether any given participant is assigned the intervention or comparator(s). Because of this, baseline characteristics tend to be similar between randomized groups, though imbalances can still occur by chance. So, at the start of the trial, each group should tend to have a similar probability of experiencing any outcome. If this similarity is properly maintained (i.e. neither bias nor confounding introduces differences between groups), then any differences in outcomes will either be due to either treatment allocation or to chance.

Towards this objective of maintaining similar groups, other strategies (such as blinding of participants, clinicians, and investigators) are often implemented to minimize differences in care between groups over the course of the trial. If these strategies are not successful, then any differences in outcomes between groups could also be attributable to these differences in care, thus introducing bias.

The internal validity sections of this chapter will focus on describing key sources of bias in RCTs, how to identify them, and how to evaluate their impact on observed study results.

Checklist Questions

Was the sequence generation random?
Was the allocation sequence concealed until participants were enrolled and assigned to interventions?
Were participants, clinicians, outcome assessors, and investigators blinded?
Were there deviations from the intended intervention that arose because of the above?
Could measurement or ascertainment of the outcome have differed between intervention groups?
Could assessment of the outcome have been influenced by knowledge of intervention received?
Did participants from the comparator group receive the intervention from the intervention group (or vice versa)?
Were data for key outcomes available for all, or nearly all, participants randomized?
Were patients analyzed in the groups to which they were randomized (ITT), or did researchers only count participants who were adherent to their study treatment (per protocol) or completed the full trial duration (completer analysis)?
Are the ITT methods appropriate?
Are any important outcomes included in the study protocol but absent from the publication? Is this justified?

Allocation Bias: Were patients appropriately randomized with allocation concealment? 

Sequence generation (i.e. randomization)

Unclear or inadequate sequence generation exaggerates relative benefits of an intervention on average by ~11% (Savović J et al.).

Table 2. Adequate and inadequate randomization methods.
Adequate Randomization Inadequate Randomization
Computer-generated random sequence generation (preferred) Quasi-randomization (e.g. alternation by case number or date of birth)
Random numbers table Treatment assignment left to the discretion of the clinician
Coin toss
Drawing cards

Allocation concealment

Unclear or inadequate allocation concealment exaggerate relative benefits of an intervention on average by ~7% (Savović J et al.).

Table 3. Adequate and inadequate allocation concealment methods.
Adequate allocation concealment Inadequate allocation concealment
Central randomization (look for “interactive web-response system” or “interactive voice-response system” within a study manuscript) (preferred) Allocation scheme posted on a bulletin board
Coded identical drug boxes/vials Non-opaque, non-tamper proof envelopes
Sequentially-numbered, tamper-proof, sealed opaque envelopes (preferably lined with cardboard or foil)
On-site locked computer system

Blinding: Were participants, treating clinicians, outcome assessors, or investigators aware of treatment assignment during the trial?

Lack of or unclear blinding is associated with an average ~13% exaggeration of the relative benefits of an intervention for dichotomous outcomes (Savović J et al.), and a 68% exaggeration of relative benefits for subjective continuous outcomes (Hróbjartsson A et al.).

Note that double-blinding does not have a standardized definition and, consequently, further examinations are needed to ascertain exactly who was blinded (Lang TA et al).

Blinding of participants and clinicians

Table 4. Adequate and inadequate blinding methods for participants and clinicians.
Adequate blinding of participants and clinicians Inadequate blinding of participants and clinicians
Used an identical placebo/control product without indication that treatments were distinguishable PROBE: Prospective randomized open-label, blinded endpoint trial (open-label refers to trial that has non-blinding as part of the design, and does not refer to cases where blinding is simply inadequate)

Blinding of outcome assessors 

Awareness of treatment allocation by participants and clinicians may introduce performance bias, whereas awareness of allocation by outcome assessors may introduce detection bias. This is compounded when participants or their clinicians are also the outcome assessors (e.g. patient aware of treatment allocation asked to rate their pain or fill out a quality-of-life questionnaire). Lack of blinding is a particularly important source of bias with the use of subjective outcomes – one review (Wood L et al.) found that lack of blinding exaggerated the OR of subjective outcomes by ~30%. Conversely, the same review found no statistically significant bias was introduced by lack of blinding for objective outcomes. This review provided evidence that all-cause mortality is a particularly resistant to detection bias even when trials are not blinded.

Table 5. Adequate blinding methods for outcome accessors and difficult situations to blind.
Adequate blinding of outcome assessors Difficult situations to blind
Independent central adjudication committee adjudicated outcomes The intervention has an effect on a readily-measurable biomarker or the drug has an easily observable adverse effect profile (e.g. iron causing darkened stools)
E.g. #1 Among several concerns raised by the FDA regarding the PLATO trial (DiNicolantonio JJ et al.), a RCT comparing ticagrelor vs. clopidogrel for patients with acute coronary syndrome, it was noted that blinding was not sufficiently protected. This is because the “dummy capsules” (identical in appearance to the ticagrelor containing capsule) could be opened, revealing a clopidogrel tablet cut in half. This could unblind both patients and sponsor site monitors (who were given unused capsules). There were also concerns that too many groups involved in the trial had access to treatment assignments (and could subsequently become unblinded). These concerns cast doubts on the internal validity of both the efficacy and safety outcomes, especially when combined with additional concerns by the FDA regarding the inaccuracy of reported events.

Some situations initially thought to be impossible to blind can be successfully blinded with some ingenuity.

E.g. #2 In ROCKET-AF (Patel MR et al.), INR was measured centrally and clinicians taking care of patients on rivaroxaban were given dummy INR values for which to adjust the warfarin-placebo dose.

Were there differences between groups in the receipt of co-interventions?

Co-interventions may introduce bias if they affect the outcomes of interest and are distributed differently between groups.

E.g. #3 CONTACT (Roddy E et al.), an unblinded non-inferiority trial comparing naproxen and colchicine for acute gout attacks, found no difference in pain control between groups. However, co-intervention analgesic use (e.g. acetaminophen, ibuprofen) was 42% in the colchicine group and only 25% in the naproxen group. This raises the possibility that pain control would have been inferior in the colchicine group had it not been for the additional analgesic use.

Was outcome monitoring assessed consistently between groups? If no, then was this likely to bias the results?

Outcome measures ought to be consistent between groups with regards to:

  • Which outcomes were examined
  • How they were examined
  • How frequently they were examined
E.g. #4 RATE-AF (Kotecha D et al.) was a RCT comparing the impact of digoxin vs. bisoprolol on quality of life in participants with heart failure with preserved ejection fraction and atrial fibrillation. Participants were prompted to report adverse effects using adverse effects listed in the medication product monograph. It is unclear if an aggregate list was used for all participants or if a drug-specific list was used for each group. Given the extensive list of adverse effects listed on the bisoprolol monograph and lay perceptions of beta-blocker-related adverse effects, differential lists would be expected to bias the adverse effect outcomes in favor of digoxin. This would be an example of differential outcome monitoring between study groups.

Crossover bias: Did participants from the comparator group receive the intervention from the intervention group (or vice versa)?

Crossover bias attenuates differences in outcomes between groups as a group accrues more participants that are taking the treatment intended for the other arm (e.g. patients in the placebo group receiving active treatment). This makes superiority harder to demonstrate and makes non-inferiority easier to demonstrate. The extent of bias introduced will depend on the extent of crossover/contamination between groups.

E.g. HPS (Heart Protection Study Collaborative Group) was a RCT evaluating the effect of simvastatin 40 mg daily vs. placebo on mortality and cardiovascular events. By the 5th year of follow up, 32% of patients in the placebo group were receiving a statin other than simvastatin, likely initiated due to higher LDL levels. If we assume these other statins were effective in reducing mortality and cardiovascular events, they may have attenuated the difference seen between the simvastatin and placebo group for these outcomes.

Missing data and loss to follow-up (LTFU): Was follow-up complete (i.e. were all patients accounted for at the end of the trial)?

Rules of thumb (e.g. LTFU is only a problem if ≥20%) are misleading; LTFU is important when it is similar to or greater than the occurrence of the outcome of interest, or when differences in the frequency or timing of LTFU differ between groups. An ITT analysis (see below) cannot correct the bias introduced by differences in LTFU between groups. In addition to LTFU, there may also be missing data due to factors such as participants missing scheduled visits, variables not being measured during a visit, or data entry errors.

If there is LTFU, consider doing your own rudimentary “worst-case scenario” analysis: Would the results remain similar if all participants lost in one treatment group had suffered the bad outcome whilst all those lost in the other group had had a good outcome, and vice versa?

E.g. #1 In a trial (El-Khalili N et al.) comparing 2 doses of quetiapine vs. placebo for adjunctive treatment of depression, completion of trial follow-up to week 6 was 77% with quetiapine 150 mg/day, 70% with quetiapine 300 mg/day, and 85% with placebo. Differences between groups were driven by a dose-dependent increase in the risk of discontinuation due to adverse events with quetiapine (1% with placebo, 11% with quetiapine 150 mg/day, and 18% with quetiapine 300 mg/day,).
E.g. #2 In PARAMEDIC2, a RCT comparing epinephrine vs. placebo for out-of-hospital cardiac arrest, the primary outcome was survival at 30 days.
Table 6. Epinephrine vs. placebo in patients experiencing out-of-hospital cardiac arrest on the outcome of survival at 30 days.
Epinephrine Placebo OR (95% CI)
Actual Analysis 130/4012 (3.2%) 94/3995 (2.4%) 1.39 (1.06 to 1.82)
LTFU 3 (<0.1%) 4 (<0.1%)
Worst-Case Analysis 130/4015 (3.2%) 97/3999 (2.4%) 1.35 (1.03 to 1.76)

This worst-case scenario does not change the statistical or clinical significance of the result, so the LTFU is not a concern for this outcome. 

E.g. #3 In PARAMEDIC2 a secondary outcome was a favorable neurological outcome at three months. 
Table 7. Epinephrine vs. placebo in patients experiencing out-of-hospital cardiac arrest on favorable neurological outcome at three months.
Epinephrine Placebo OR (95% CI)
Actual Analysis 82/3986 (2.1%) 63/3979 (1.6%) 1.31 (0.94 to 1.82)
LTFU 29 20
Worst-Case Analysis 82/4015 (2.0%) 83/3999 (2.1%) 0.98 (0.72 to 1.34)

While the results are not statistically significant in both actual and worst-case analyses, the worst case analysis shifts the CI to be notably more pessimistic regarding the effects of epinephrine on this outcome. The absolute difference between the actual and worse-case analysis is only 0.6%. However, in the context of the trial, where absolute survival was only 0.8% greater with epinephrine, this relatively small difference is nonetheless still important when considering the net benefit of epinephrine.

Intention-to-Treat (ITT): Were patients analyzed in the groups to which they were randomized?

There are numerous methods to carry out an ITT analysis (e.g. last observation carried forward (LOCF), mixed model for repeated measurements, sensitivity analyses). All of them rely on assumptions and no single method works in every situation.

E.g. In dementia trials evaluating the efficacy of cholinesterase inhibitors LOCF is the most common approach to ITT. This occurs despite violating the LOCF assumption that, if left untreated, disease severity will remain stable. Patients given cholinesterase inhibitors tend to discontinue earlier in the trial (earlier in the decline) due to intolerable side-effects, giving the appearance that the patient’s cognition has ceased to decline (Molnar FJ et al. 2008 and 2009).

Reporting Bias: Are any important outcomes noted in the study protocol absent on publication?

If a trial does not report a clinically important outcome despite it being in the protocol, this warrants suspicion that the intervention did not provide benefit (or was possibly harmful) with respect to that outcome.

E.g. EPHESUS (Pitt B et al.) was a RCT comparing eplerenone vs. placebo in patients with left ventricular dysfunction after myocardial infarction. None of the published reports of EPHESUS have reported on quality of life despite this being a pre-specified outcome of the trial (Spertus JA et al.). As such, it is not possible to determine the impact (beneficial, harmful or neutral) of eplerenone on quality of life in these patients.

See “What proportion of the included studies report on this outcome?” here for a further discussion on outcome reporting bias.

definition

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

NERDCAT Copyright © 2022 by Ricky Turgeon is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book