2 Risk of bias: Are the results internally valid?
Randomization is the core of the RCT and ensures that the play of chance dictates whether any given participant is assigned the intervention or comparator(s). Because of this, baseline characteristics tend to be similar between randomized groups, though imbalances can still occur by chance. So, at the start of the trial, each group should tend to have a similar probability of experiencing any outcome. If this similarity is properly maintained (i.e. neither bias nor confounding introduces differences between groups), then any differences in outcomes will either be due to either treatment allocation or to chance.
Towards this objective of maintaining similar groups, other strategies (such as blinding of participants, clinicians, and investigators) are often implemented to minimize differences in care between groups over the course of the trial. If these strategies are not successful, then any differences in outcomes between groups could also be attributable to these differences in care, thus introducing bias.
The internal validity sections of this chapter will focus on describing key sources of bias in RCTs, how to identify them, and how to evaluate their impact on observed study results.
Checklist Questions
Was the sequence generation random? |
Was the allocation sequence concealed until participants were enrolled and assigned to interventions? |
Were participants, clinicians, outcome assessors, and investigators blinded? |
Were there deviations from the intended intervention that arose because of the above? |
Could measurement or ascertainment of the outcome have differed between intervention groups? |
Could assessment of the outcome have been influenced by knowledge of intervention received? |
Did participants from the comparator group receive the intervention from the intervention group (or vice versa)? |
Were data for key outcomes available for all, or nearly all, participants randomized? |
Were patients analyzed in the groups to which they were randomized (ITT), or did researchers only count participants who were adherent to their study treatment (per protocol) or completed the full trial duration (completer analysis)? |
Are the ITT methods appropriate? |
Are any important outcomes included in the study protocol but absent from the publication? Is this justified? |
Allocation Bias: Were patients appropriately randomized with allocation concealment?
Sequence generation (i.e. randomization)
Unclear or inadequate sequence generation exaggerates relative benefits of an intervention on average by ~11% (Savović J et al.).
Adequate Randomization | Inadequate Randomization |
Computer-generated random sequence generation (preferred) | Quasi-randomization (e.g. alternation by case number or date of birth) |
Random numbers table | Treatment assignment left to the discretion of the clinician |
Coin toss | |
Drawing cards |
Allocation concealment
Unclear or inadequate allocation concealment exaggerate relative benefits of an intervention on average by ~7% (Savović J et al.).
Adequate allocation concealment | Inadequate allocation concealment |
Central randomization (look for “interactive web-response system” or “interactive voice-response system” within a study manuscript) (preferred) | Allocation scheme posted on a bulletin board |
Coded identical drug boxes/vials | Non-opaque, non-tamper proof envelopes |
Sequentially-numbered, tamper-proof, sealed opaque envelopes (preferably lined with cardboard or foil) | |
On-site locked computer system |
Blinding: Were participants, treating clinicians, outcome assessors, or investigators aware of treatment assignment during the trial?
Lack of or unclear blinding is associated with an average ~13% exaggeration of the relative benefits of an intervention for dichotomous outcomes (Savović J et al.), and a 68% exaggeration of relative benefits for subjective continuous outcomes (Hróbjartsson A et al.).
Note that double-blinding does not have a standardized definition and, consequently, further examinations are needed to ascertain exactly who was blinded (Lang TA et al).
Blinding of participants and clinicians
Adequate blinding of participants and clinicians | Inadequate blinding of participants and clinicians |
Used an identical placebo/control product without indication that treatments were distinguishable | PROBE: Prospective randomized open-label, blinded endpoint trial (open-label refers to trial that has non-blinding as part of the design, and does not refer to cases where blinding is simply inadequate) |
Blinding of outcome assessors
Awareness of treatment allocation by participants and clinicians may introduce performance bias, whereas awareness of allocation by outcome assessors may introduce detection bias. This is compounded when participants or their clinicians are also the outcome assessors (e.g. patient aware of treatment allocation asked to rate their pain or fill out a quality-of-life questionnaire). Lack of blinding is a particularly important source of bias with the use of subjective outcomes – one review (Wood L et al.) found that lack of blinding exaggerated the OR of subjective outcomes by ~30%. Conversely, the same review found no statistically significant bias was introduced by lack of blinding for objective outcomes. This review provided evidence that all-cause mortality is a particularly resistant to detection bias even when trials are not blinded.
Adequate blinding of outcome assessors | Difficult situations to blind |
Independent central adjudication committee adjudicated outcomes | The intervention has an effect on a readily-measurable biomarker or the drug has an easily observable adverse effect profile (e.g. iron causing darkened stools) |
Some situations initially thought to be impossible to blind can be successfully blinded with some ingenuity.
Were there differences between groups in the receipt of co-interventions?
Co-interventions may introduce bias if they affect the outcomes of interest and are distributed differently between groups.
Was outcome monitoring assessed consistently between groups? If no, then was this likely to bias the results?
Outcome measures ought to be consistent between groups with regards to:
- Which outcomes were examined
- How they were examined
- How frequently they were examined
Crossover bias: Did participants from the comparator group receive the intervention from the intervention group (or vice versa)?
Crossover bias attenuates differences in outcomes between groups as a group accrues more participants that are taking the treatment intended for the other arm (e.g. patients in the placebo group receiving active treatment). This makes superiority harder to demonstrate and makes non-inferiority easier to demonstrate. The extent of bias introduced will depend on the extent of crossover/contamination between groups.
Missing data and loss to follow-up (LTFU): Was follow-up complete (i.e. were all patients accounted for at the end of the trial)?
Rules of thumb (e.g. LTFU is only a problem if ≥20%) are misleading; LTFU is important when it is similar to or greater than the occurrence of the outcome of interest, or when differences in the frequency or timing of LTFU differ between groups. An ITT analysis (see below) cannot correct the bias introduced by differences in LTFU between groups. In addition to LTFU, there may also be missing data due to factors such as participants missing scheduled visits, variables not being measured during a visit, or data entry errors.
If there is LTFU, consider doing your own rudimentary “worst-case scenario” analysis: Would the results remain similar if all participants lost in one treatment group had suffered the bad outcome whilst all those lost in the other group had had a good outcome, and vice versa?
Epinephrine | Placebo | OR (95% CI) | |
Actual Analysis | 130/4012 (3.2%) | 94/3995 (2.4%) | 1.39 (1.06 to 1.82) |
LTFU | 3 (<0.1%) | 4 (<0.1%) | |
Worst-Case Analysis | 130/4015 (3.2%) | 97/3999 (2.4%) | 1.35 (1.03 to 1.76) |
This worst-case scenario does not change the statistical or clinical significance of the result, so the LTFU is not a concern for this outcome.
Epinephrine | Placebo | OR (95% CI) | |
Actual Analysis | 82/3986 (2.1%) | 63/3979 (1.6%) | 1.31 (0.94 to 1.82) |
LTFU | 29 | 20 | |
Worst-Case Analysis | 82/4015 (2.0%) | 83/3999 (2.1%) | 0.98 (0.72 to 1.34) |
While the results are not statistically significant in both actual and worst-case analyses, the worst case analysis shifts the CI to be notably more pessimistic regarding the effects of epinephrine on this outcome. The absolute difference between the actual and worse-case analysis is only 0.6%. However, in the context of the trial, where absolute survival was only 0.8% greater with epinephrine, this relatively small difference is nonetheless still important when considering the net benefit of epinephrine.
Intention-to-Treat (ITT): Were patients analyzed in the groups to which they were randomized?
There are numerous methods to carry out an ITT analysis (e.g. last observation carried forward (LOCF), mixed model for repeated measurements, sensitivity analyses). All of them rely on assumptions and no single method works in every situation.
Reporting Bias: Are any important outcomes noted in the study protocol absent on publication?
If a trial does not report a clinically important outcome despite it being in the protocol, this warrants suspicion that the intervention did not provide benefit (or was possibly harmful) with respect to that outcome.
See “What proportion of the included studies report on this outcome?” here for a further discussion on outcome reporting bias.
Randomized controlled trials are those in which participants are randomly allocated to two or more groups which are given different treatments.
Systematic deviation of an estimate from the truth (either an overestimation or underestimation) caused by a study design or conduct feature. See the Catalog of Bias for specific biases, explanations, and examples.
The extent to which the study results are attributable to the intervention and not to bias. If internal validity is high, there is high confidence that the results are due to the effects of treatment (with low internal validity entailing low confidence).
The process by which allocation of participants to groups is conducted. Computer generation and coin tosses are examples of methods of random sequence generation.
Participant outcomes are analyzed according to their assigned treatment group, irrespective of treatment received. A common "modified ITT" approach used in pharmacotherapy trials considers only participants who received at least one dose of the study drug (thereby excluding participants who were randomized but did not receive any study intervention).
This type of analysis examines patients only if they sufficiently adhered to the treatment group in which they were assigned.
Calculates the effect of an intervention via a fractional comparison with the comparator group (i.e. intervention group measure ÷ comparator group measure). Used for binary outcomes. Relative risk, odds ratio, or hazards ratio are all expressions of relative effect. For example, if the risk of developing neuropathy was 1% in the treatment group and 2% in the comparator group, then the relative risk is 0.5 (1 ÷ 2). See the Absolute Risk Differences and Relative Measures of Effect discussion here for more information.
Refers to the process that prevents patients, clinicians, and researchers from predicting which intervention group the patient will be assigned. This is different from blinding; allocation concealment refers to patients/clinicians/outcome assessors/etc. being unaware of group allocation prior to randomization, whereas blinding refers to remaining unaware of group allocation after randomization. Allocation concealment is a necessary condition for blinding. It is always feasible to implement.
Double-blinding does not have a standardized definition and, consequently, further examinations are needed to ascertain exactly who was blinded (Lang TA et al).
Odds ratios are the ratio of odds (events divided by non-events) in the intervention group to the odds in the comparator group. For example, if the odds of an event in the treatment group is 0.2 and the odds in the comparator group is 0.1, then the OR is 2 (0.2/0.1). See here for a more detailed discussion.
Occurs when participants receive treatment intended for the other study group (a phenomenon known as contamination). For example, a participant assigned to the placebo group may end up taking active treatment. This bias results in underestimating the difference between groups.
A superiority trial tests for whether an intervention has a greater effect than a comparator with respect to the primary outcome. This is contrasts with non-inferiority trials.
Loss to follow-up may occur when participants stop coming to study follow-up visits, do not answer follow-up phone calls, and cannot otherwise be assessed for study outcomes. This leads to missing data from the time they became "lost". Underlying reasons may include leaving the trial without informing investigators, moving to a new location, debilitation due to illness, or death.
A primary outcome is an outcome from which trial design choices are based (e.g. sample size calculations). Primary outcomes are not necessarily the most important outcomes.
A secondary outcome is any outcome that is not a primary outcome (i.e. secondary outcomes are not the focal point of design choices like sample size). Secondary outcomes may be more clinically important than the primary outcome.
Absolute risk difference is the risk in one group compared to (minus) the risk in another group over a specified period of time. For example, if the absolute risk of myocardial infarction over 5 years was 15% for the comparator and 10% for the intervention, then the absolute risk difference was 5% (15% - 10%) over 5 years. See here for further discussion.
A method of evaluating patients who have dropped out partway through a trial when performing an intention-to-treat analysis. It treats the patients as if they were still in the trial and their outcome status remained the same as when they were last observed. For example, a patient who reported a pain score of 7/10 at day 3 and dropped out prior to the 1-week follow-up would be analyzed as having 7/10 pain at the end of 1 week (despite no outcome data being recorded past day 3).