8.4 Level of Significance and the p-Value

Mariana Gatzeva

8 Hypotheses Testing

8.4 Level of Significance and the p-Value

The concept level of significance is used to adjudicate whether the probability (of our results if the null hypothesis is true) is too high to dismiss the null hypothesis or low enough to allow us to reject the null hypothesis. In other words, the level of significance is what we use to proclaim results as statistically significant (when we reject the null hypothesis) or not statistically significant (when we fail to reject the null hypothesis).

Think about it this way: recall that with confidence intervals we had selected 95% certainty and 99% certainty as meaningful levels of confidence. What is left is 5% and 1% “uncertainty”, as it were, which we agree to tolerate. These 5% or 1% are distributed equally between the two tails of the normal distribution (2.5% on each side or 0.5% on each side, respectively). They also correspond to z=1.96 and z=2.58. Following the logic of Example 8.2 (A) from the preivious section, in order to reject a null hypothesis, we want the probability to be lower that these 5% or 1% (so that we can “feel confident enough”).

And this is exactly it: When we put it that way, saying that we want the probability (of the null hypothesis being true) — called a p-value — to be less than 5%, we have essentially set the level of significance at 0.05. If we want the probability to be less than 1%, we have set the level of significance at 0.01. We can go even further: we might want to be extra cautious and to want a “confidence” of 99.99%, so that we want the probability to be less than 0.01% — then we have set the level of significance at 0.001.

These three numbers — 0.05, 0.01, and 0.001 — are the most commonly used levels of significance. The level of significance is denoted by the small-case Greek letter a, i.e., α, thus we usually choose one of the following:

$\alpha=0.05$

$\alpha=0.01$

$\alpha=0.001$

You can think of the significance level as the acceptable probability of being wrong — and what is acceptable is left to the discretion of the researcher, subject to the purposes of the particular study.

Following the logic presented in Example 8.2(A) then, if the probability of the result under the null hypothesis — the p-value — is smaller than a pre-selected significance level α, the null hypothesis is rejected and the result is considered statistically significant^[1]. This is denoted in one of the following ways:

p ≤ 0.05

p ≤ 0.01

p ≤ 0.001^[2]

To summarize, when a hypothesis is tested, we end up with an associated p-value (again, the probability of the observed sample statistics if the null hypothesis is true). We compare the p-value to the pre-selected significance level α: if p ≤ α, the results are statistically significant and therefore generalizable to the population.

So far so good? Good. However, unfortunately this isn’t all (sorry!). What I have presented above is the most conventional treatment of how to use and interpret p-values. It is attractively straightforward — but it’s also arbitrary, and its true interpretation is subject of an ongoing debate. As an introduction to the topic, I will leave it at that but you should be aware that there’s more to the p-value, and that its usage has been (rightfully) questioned and/or challenged in recent years.^[3].

Going back to our example from the preivious section, let’s see how the p-values can change due to particular features of the study, like the sample size. Example 8.2(B) illustrates.

Example 8.2(B) Employee Productivity (Finding Statistically Non-significant Results, N=25)

Imagine that we had the same information as in Example 8.2(A), however, 25 employees took the training course instead of 100 and their average score was 620. The we have:

$\mu=600$

$\sigma=100$

$\overline{x}=620$

$N=25$

We still want to know the probability of a score of 620 if the training course didn’t contribute to the gain, i.e., the probability of a score of 620 under the condition of the null hypothesis.

H₀: The training course did not affect productivity (the 620 score was due to random chance); $\mu_\overline{x}$ $=\mu$ .
H_a: The training course affected productivity (the 620 score was a true gain); $\mu_\overline{x}$ $\neq\mu$ .

The new standard error is:

$\sigma_\overline{x}$ $=\frac{\sigma}{\sqrt{N}}=\frac{100}{\sqrt{25}}=\frac{100}{5}=20$

Then the z-value of 620 is:

$z=\frac{\overline{x}-\mu}{\sigma_\overline{x}}$ $=\frac{620-600}{20}=\frac{20}{20}=1$

Given the properties of the normal curve, we know that 68% of all means in infinite sampling will fall between ±1 standard error (i.e, between 580 and 620), 95% will fall between ±1.96 standard errors (i.e., approximately between 560 and 640), and 99% will fall between ±2.58 standard errors (i.e., approximately between 540 and 660). The score of 620 has $z=1$ — it falls quite close to the not-trained group’s mean of 600.

In terms of probabilities, consider the following: z=1 has a p>0.30. Assuming the null hypothesis is true, our calculations show that the 620 score will appear more than 30% of the time due to random chance, which is a lot more than the 5% (at α=0.05) that we are willing to tolerate. As such, we cannot reject the null hypothesis: we do not have enough evidence to conclude that the gain in productivity of 20 points which the 25 employees demonstrated is statistically significant. In other words, we don’t have enough evidence that the training course was effective. (This doesn’t mean that it didn’t beyond a shadow of a doubt, just that at this point in this particular study we don’t have enough evidence to say it did.)

We can also see the correspondence with confidence intervals:

95% CI: $\overline{x}\pm1.96\times\sigma_\overline{x}$ $= 620\pm1.96\times20=620\pm39.2=(580.8; 659.2)$

That is, we can be 95% certain that the average score for the population of employees who take the training course would be between roughly 581 points and 659 points. The average general score of 600 points is a plausible value for $\mu_\overline{x}$ , which is consistent with our decision to not reject the null hypothesis.

Again, Example 8.2 is a heuristic device, used only to explain the logic of hypotheses testing. Of course, normally we wouldn’t have information about population parameters and we will be using sample statistics (i.e., we would use not only the sample mean $\overline{x}$ but also the sample standard deviation s, to calculate the estimated sampling distribution $s_\overline{x}$ ). (Not to mention that we would have two different standard deviations, one for the trained group and one for the not-trained group of employees.) As you learned in the previous chapter, this moves us from using the z-distribution to the t-distribution with given degrees of freedom. Recall that with a sample size of about 100 — i.e., with df=100 — the two distributions converge.

Here then is a quick-and-dirty method you can use as a preliminary indication of whether something will be statistically significant. Since z=1.96 corresponds to 5% probability (2.5% in each tail), and z=2.58 corresponds to 1% probability (0.5% in each tail), even without knowing the exact p-value associated with a given z-value, you can guess that getting a z<1.96 will be non-significant while a z>1.96 will be significant at α=0.05; similarly, getting a z>2.58 will be statistically significant at α=0.01^[4]. As samples used in sociological research are commonly of N>100, the same insight applies to the corresponding t-values with df≥100. Understand, however, that this is not an official way to test hypotheses or report findings: to do that, you always need to report the exact p-value associated with a z-value or a t-value with given df^[5].

One-tailed tests. Finally, a note on one-tailed tests. While at the beginner researcher level, I advise you against using them yourself, it is not a bad idea to know they exist and what they are. Briefly, the idea is that if we have a good reason to suspect not only a difference/effect but a difference/effect with a specific direction (i.e., positive or negative), we can specify the hypotheses accordingly. To use Example 8.2(A) again, say, we think there is no possibility that the training course decreased productivity scores. Then we can state the hypotheses as:

H₀: The training course either did not affect productivity or decreased it; $\mu_\overline{x}$ ≤ $\mu$ .
H_a: The training course increased productivity; $\mu_\overline{x}$ > $\mu$ .

This is a stronger claim (that’s why it needs to be well-justified) — we test not a difference (that can be either positive or negative) but an increase. Thus, we move the significance level to only one of the tails, as it were, the positive (right) tail, so instead of 2.5% being there, 5% are.

This change in probability essentially “moves” the z-value corresponding to significance closer to the mean; now a smaller z-value will have the p-value necessary to achieve statistical significance. To be precise, 5% (2.5% in each tail) corresponded to z=1.96; all 5% in the right tail corresponds to z=1.65^[6]. This obviously “lowers the bar” of achieving statistical significance without changing the level of significance α itself, and makes rejecting the null hypothesis easier, hence my description of the two-tailed test as more conservative (and my insistence on using it instead of a one-tailed test).

Before we move on to the last section of this theoretical chapter, the promised warning about the meanings of the term significance.

Watch out!! #15 … for Mistaking Statistical Significance for Magnitude or Importance

If you have been paying attention, you have learned by now that statistical significance has a very narrow meaning. To have a statistically significant result simply means that the probability of observing our sample statistics (or difference, or effect, etc.) as they are, given that the null hypothesis is true, is small enough to be (highly) unusual, to be so relatively rare as to indicate what we have is not a result of random sampling variation but of untrue null hypothesis.

None of this says anything about how big a difference/effect is — in fact it can be quite small, and still statistically significant, given large enough sample size and other study specifications^[7].

Similarly, many people unfamiliar with statistics take statistical significance to mean that the finding are of significant importance. Again, nothing about statistical significance confers great meaning to or implies importance of statistically significant findings. One can study an objectively trivial/unimportant issue and have statistically significant findings of no relevance to anyone whatsoever.

To conclude, keep these distinctions in mind — between the conventional usage of the word significant (meaning either important, or big) and statistical significance — both when interpreting and reporting results and when reading and evaluating existing research.

When testing hypotheses, I defined the significance level as sort of probability of being wrong we are willing to tolerate. This implies that a likelihood of making an erroneous decision about the null hypothesis (to reject it or not) exists. The next and final section deals with just that.

Note the difference between α and the p-value. While α indicates what probability of being wrong we are willing to tolerate, the actual p-value we obtain is not the probability of being wrong. The p-value, again, is the probability of our result if the null hypothesis were true; in other words, if the null hypothesis is in fact true, and our p-value is, say, 0.03, we'd obtain our results 3% of the time simply due to random sampling error. ↵
In published research you will find results marked by one asterisk, two asterisks, and three asterisks. These correspond to their significance based on the level used: α=0.05, α=0.01, and α=0.001, respectively. The smaller the level of significance, the more strongly statistically significant the result is (i.e., most consider α=0.001 to indicate "highly statistically significant" results). (If you happen upon a dagger (†), it indicates significance at α=0.1 level, or 10% probability of being wrong, which most researchers consider too high, but some still use. ↵
You can find plenty of information on the topic online; from journals banning the use of p-values and hypothesis testing in favour of effect size (the Journal of Applied and Social Psychology, see Trafimow & Marks, 2015 https://www.tandfonline.com/doi/full/10.1080/01973533.2015.1012991), to calls to abandon statistical significance (e.g., McShane, Gal, Gelman, Robert & Tackett, 2019 https://www.tandfonline.com/doi/abs/10.1080/00031305.2018.1527253), to others calling for its and p-values' defense (e.g., Kuffner & Walker, 2016 https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1277161?src=recsys; Greenland, 2019 https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529625?src=recsys). One thing is clear: p-values and levels of significance have become increasingly controversial. Still, the American Statistical Association's position is that although caution against over-reliance on a single indicator is necessary, p-values can still be used, alongside with other appropriate methods: "Researchers should recognize that a p-value without context or other evidence provides limited information. For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a p-value when other approaches are appropriate and feasible" (Wasserstein & Lazar, 2016https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108?src=recsys). Finally, if you really want to not to overstate what the p-value actually shows, see Greenland et al. (2016) for a of common misinterpretations and over-interpretations of the p-value, of confidence intervals, and tests significance (here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/). Because of its enormity, the topic is still conventionally taught as I presented it above (as it goes way beyond the scope of this book), at least at introductory level. ↵
Obviously, for negative z-values we'll have all these in reverse: -z>-1.96 will be non-significant and -z<-1.96 will be significant, etc. ↵
You can find a handy online p-value calculator of t-values here: https://goodcalculators.com/student-t-value-calculator/. ↵
You can check it here by selecting "up to Z": https://www.mathsisfun.com/data/standard-normal-distribution-table.html. ↵
This is actually one of the reasons some have called for abandoning p-values, statistical significance, and hypothesis testing whatsoever, because statistical significance is not indicative of effect size and is frequently over-stated to mean more than it does; at the same time over-reliance on p-values decreases attention to effect size, careful study design, context, etc. ↵

8.4 Level of Significance and the p-Value

License

Share This Book