8 Hypotheses Testing
8.5 Errors of Inference
Making decisions about hypotheses is inference based on evidence and logic. Inference, however, doesn’t come with a guarantee of being right — in fact, it is guaranteed that being right all the time is impossible. All the evidence and logic in the world will not be enough to ensure 100 percent certainty of making the right decision simply by the probabilistic nature of statistical inference. As long as we work with samples to estimate populations, some amount of uncertainty will be unavoidable — or it wouldn’t be called inference but knowing.
Logically speaking, since we have two options given a null hypothesis (to reject or not to reject), we can make two types of mistakes. One is to be wrong about rejecting the null hypothesis, the other to be wrong about not rejecting it.
You might be rolling your eyes at this — well duh! — but bear with me: these really are the two types of statistical error, imaginatively called Type I and Type II.
If we reject a true null hypothesis, we commit a Type I error. If we fail to reject a false null hypothesis, we commit a Type II error. Before I even explain these further, make a mental note that since we either reject or fail to reject a null hypothesis — one or the other — at any given time we can only make only one of the two types of errors. If you rejected your null hypothesis, the only error you could have committed is Type I; if you did not rejected your null hypothesis, the only error you could have made is Type II.
The trick is that we never know if we have made an error or not. (If we knew, we wouldn’t be making it in the first place, right?) We only know that the possibility that we have made an error exists. However, as with everything about inference we have discussed so far, what we can do is to quantify the uncertainty as best as we can.
Table 8.1 summarizes the errors of inference based on the (unknown) real situation and the (uncertain) decision we have made about it, through an analogy of a criminal trial. The null hypothesis then stands for “innocent” (no effect/difference/association, etc.) while the alternative hypothesis stands for “guilty” (there is an effect/difference/association, etc.).
Table 8.1 Errors of Statistical Inference
Reality: Guilty | Reality: Innocent | |
Reject H0: Innocent ⇒ Guilty Verdict | Correct Decision (1-β) | Type I Error (α) |
Fail to Reject H0: Innocent ⇒ Innocent Verdict | Type II Error (β) | Correct Decision |
Recall that to reject the null hypothesis, we had to have a test with a p-value lower than the pre-selected level of significance α, i.e., p≤α. The level of significance amounted essentially to how much probability of being wrong we were able to tolerate (so as long as the probability of having the observations we did, given a true null hypothesis — i.e., the p-value — was less than that, we would be fine).
Now consider that I just defined Type I error as the probability that we are wrong about rejecting a true null hypothesis — and ta-dam! — Type I error is exactly equal to α, the significance level! The great thing about it is that it is not only precise, it is also utterly under our control as we are the ones to decide how much error (regarding “convicting an innocent”) we want to tolerate. If we want a smaller such chance, we can just raise the bar — so that only the smallest p-values can pass under the lowest possible αMake sure you don not confuse p and α, especially in that p does not show the probability of being wrong. Even the significance level is not the true error rate (Selkke, Bayarri & Berger, 2001), as you can see here if you're curious: https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-correctly-interpret-p-values.[/footnote].
On the other hand, when we fail to reject a false null hypothesis (i.e., when we "let a guilty person go free as if innocent"), we make a Type II error, called β. At the same time, as you can see in Table 8.1, the probability to correctly reject a false null hypothesis is a neat 1-β (after all, the decision has only two options), known as the power of the test.
Unfortunately, there is no way for us to directly control β; your best bet is to have a large sample size, which increases the test's power (to detect an effect/difference/"guilt") where it truly exists, and thus indirectly decreasing β.
Well then, you might logically ask, why don't we just decrease both Type I and Type II errors? I am afraid you cannot do that: Type I and Type II errors are opposites, and as such there is a trade-off between them. Think about it: if you hate the thought of convicting an innocent, and say you would never do it, you will end up deciding "innocence" all the time, thus inevitably at some point letting a criminal go. If you decide that you hate letting criminals go, you can convict everyone, but then of course, eventually you will inevitably end up convicting an innocent.
In other words, the harder you make it to reject a null hypothesis/"to convict" (by making α the lowest possible), the higher the chances you will commit Type II error, failing to reject a false null hypothesis (and you will let a criminal slip free). The easier you make it to reject a null hypothesis/"to convict" (by making α as high as you want), of course the higher the odds of committing Type I error, rejecting a true null hypothesis (and convicting an innocent).
In summary, the errors of inference are unavoidable: every time we make a decision about the null hypothesis one way or the other, we run the risk of making one of the statistical errors. With a careful selection of α and a comfortably large sample size, making an error should not worry you too much -- but do not forget that it is a distinct possibility.
I end this chapter with a warning.
Watch out!! #16 . . . for Mixing Up Your Error Concepts
The statistical errors presented in this chapter aside, you might recall that we discussed two other error concepts, the random error and the standard error. Make a note about all three: 1) the random error, 2) the standard error, and, 3) the Type I error and Type II error of statistical inference. They are all different concepts.
As a brief reminder, the random error is an inevitable corollary of sampling and reflects the fact that a sample is different from the population from which it was taken; the standard error is simply a formula for the standard deviation of the sampling distribution; and finally, the Type I and Type II statistical errors apply to decisions about the null hypothesis during testing.
Now that you know how hypothesis testing works in principle, let's get us some variables' associations tested with their appropriate tests, in Chapter 9 and Chapter 10.