{"id":2083,"date":"2019-10-25T19:18:49","date_gmt":"2019-10-25T23:18:49","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/simplestats\/?post_type=chapter&#038;p=2083"},"modified":"2019-11-06T17:04:57","modified_gmt":"2019-11-06T22:04:57","slug":"8-4-level-of-significance-and-the-p-value","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/simplestats\/chapter\/8-4-level-of-significance-and-the-p-value\/","title":{"raw":"8.4 Level of Significance and the p-Value","rendered":"8.4 Level of Significance and the p-Value"},"content":{"raw":"[latexpage]\r\n\r\nThe concept\u00a0<em>level of significance<\/em> is used to adjudicate whether the probability (of our results if the null hypothesis is true) is too high to dismiss the null hypothesis or low enough to allow us to reject the null hypothesis. In other words, the level of significance is what we use to proclaim results as statistically significant (when we reject the null hypothesis) or not statistically significant (when we fail to reject the null hypothesis).\r\n\r\n&nbsp;\r\n\r\nThink about it this way: recall that with confidence intervals we had selected 95% certainty and 99% certainty as meaningful levels of confidence. What is left is 5% and 1% \"uncertainty\", as it were, which we agree to tolerate. These 5% or 1% are distributed equally between the two tails of the normal distribution (2.5% on each side or 0.5% on each side, respectively). They also correspond to <em>z<\/em>=1.96 and <em>z<\/em>=2.58. Following the logic of Example 8.2 (A) from the preivious section, in order to reject a null hypothesis, we want the probability to be lower that these 5% or 1% (so that we can \"feel confident enough\").\r\n\r\n&nbsp;\r\n\r\nAnd this is exactly it: When we put it that way, saying that we want the probability (of the null hypothesis being true) -- called a <em>p-value<\/em> -- to be less than 5%, we have essentially set the level of significance at 0.05. If we want the probability to be less than 1%, we have set the level of significance at 0.01. We can go even further: we might want to be extra cautious and to want\u00a0<span style=\"text-indent: 18.6667px;font-size: 14pt\">a \"confidence\" of 99.99%<\/span><span style=\"text-indent: 1em;font-size: 14pt\">, so that we want the probability to be less than 0.01% -- then we have set the level of significance at 0.001.\u00a0\u00a0<\/span>\r\n\r\n&nbsp;\r\n\r\nThese three numbers -- 0.05, 0.01, and 0.001 -- are the most commonly used levels of significance. The level of significance is denoted by the small-case Greek letter <em>a<\/em>, i.e.,\u00a0<em>\u03b1<\/em>, thus we usually choose one of the following:\r\n\r\n&nbsp;\r\n\r\n$\\alpha=0.05$\r\n\r\n$\\alpha=0.01$\r\n\r\n$\\alpha=0.001$\r\n\r\n&nbsp;\r\n\r\n<strong>You can think of the significance level as the acceptable probability of being wrong<\/strong> -- and what is acceptable is left to the discretion of the researcher, subject to the purposes of the particular study.\r\n\r\n&nbsp;\r\n\r\nFollowing the logic presented in Example 8.2(A) then, <strong>if the probability of the result under the null hypothesis -- the p-value -- is smaller than a pre-selected significance level\u00a0<em>\u03b1<\/em>, the null hypothesis is rejected and the result is considered statistically significant<\/strong>[footnote]Note the difference between <em>\u03b1<\/em> and the <em>p<\/em>-value. While <em>\u03b1<\/em> indicates what probability of being wrong we are willing to tolerate, the actual<em> p<\/em>-value we obtain is <em>not<\/em> the probability of being wrong. The <em>p<\/em>-value, again, is the probability of our result if the null hypothesis were true; in other words, if the null hypothesis is in fact true, and our <em>p<\/em>-value is, say, 0.03, we'd obtain our results 3% of the time simply due to random sampling error. [\/footnote]. This is denoted in one of the following ways:\r\n\r\n&nbsp;\r\n\r\np \u2264 0.05\r\n\r\np \u2264 0.01\r\n\r\np \u2264 0.001[footnote]In published research you will find results marked by one asterisk, two asterisks, and three asterisks. These correspond to their significance based on the level used: <em>\u03b1<\/em>=0.05, <em>\u03b1<\/em>=0.01, and <em>\u03b1<\/em>=0.001, respectively. The smaller the level of significance, the more strongly statistically significant the result is (i.e., most consider\u00a0<em>\u03b1<\/em>=0.001 to indicate \"highly statistically significant\" results). (If you happen upon a dagger (\u2020), it indicates significance at <em>\u03b1<\/em>=0.1 level, or 10% probability of being wrong, which most researchers consider too high, but some still use.[\/footnote]\r\n\r\n&nbsp;\r\n\r\n<strong>To summarize, when a hypothesis is tested, we end up with an associated <em>p<\/em>-value (again, the probability of the observed sample statistics if the null hypothesis is true). We compare the <em>p<\/em>-value to the pre-selected significance level\u00a0<em>\u03b1<\/em>: if <em>p<\/em>\u00a0\u2264 <em>\u03b1<\/em>, the results are statistically significant and therefore generalizable to the population.<\/strong>\r\n\r\n&nbsp;\r\n\r\nSo far so good? Good. However, unfortunately this isn't all (sorry!). What I have presented above is the most conventional treatment of how to use and interpret\u00a0<em>p<\/em>-values. It is attractively straightforward\u00a0 -- but it's also arbitrary, and its<em> true<\/em> interpretation is subject of an ongoing debate. As an introduction to the topic, I will leave it at that but you should be aware that there's more to the <em>p<\/em>-value, and that its usage has been (rightfully) questioned and\/or challenged in recent years.[footnote] You can find plenty of information on the topic online; from journals banning the use of <em>p<\/em>-values and hypothesis testing in favour of effect size (the <em>Journal of Applied and Social Psychology<\/em>, see Trafimow &amp; Marks, 2015 https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/01973533.2015.1012991), to calls to abandon statistical significance (e.g., McShane, Gal, Gelman, Robert &amp; Tackett, 2019\u00a0https:\/\/www.tandfonline.com\/doi\/abs\/10.1080\/00031305.2018.1527253), to others calling for its and <em>p<\/em>-values' defense (e.g., Kuffner &amp; Walker, 2016\u00a0https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2016.1277161?src=recsys; Greenland, 2019 https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2018.1529625?src=recsys). One thing is clear: <em>p<\/em>-values and levels of significance have become increasingly controversial. Still, the American Statistical Association's position is that although caution against over-reliance on a single indicator is necessary, <em>p<\/em>-values can still be used, <em>alongside with other appropriate methods<\/em>: \"<span>Researchers should recognize that a\u00a0<\/span><i>p<\/i><span>-value without context or other evidence provides limited information. For example, a\u00a0<\/span><i>p<\/i><span>-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large\u00a0<\/span><i>p<\/i><span>-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a\u00a0<\/span><i>p<\/i><span>-value when other approaches are appropriate and feasible\" (Wasserstein &amp; Lazar, 2016https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2016.1154108?src=recsys). Finally, if you really want to not to overstate what the <em>p<\/em>-value actually shows, see Greenland et al. (2016) for a of common misinterpretations and over-interpretations of the <em>p<\/em>-value, of confidence intervals, and tests significance (here: <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4877414\/\">https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4877414\/<\/a>). Because of its enormity, the topic is still conventionally taught as I presented it above (as it goes way beyond the scope of this book), at least at introductory level.<\/span>[\/footnote].\r\n\r\n&nbsp;\r\n\r\nGoing back to our example from the preivious section, let's see how the <em>p<\/em>-values can change due to particular features of the study, like the sample size. Example 8.2(B) illustrates.\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em>Example 8.2(B) Employee Productivity (Finding Statistically Non-significant Results, N=25)<\/em><\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n&nbsp;\r\n\r\nImagine that we had the same information as in Example 8.2(A), however, 25 employees took the training course instead of 100 and their average score was 620. The we have:\r\n\r\n&nbsp;\r\n\r\n$\\mu=600$\r\n\r\n$\\sigma=100$\r\n\r\n$\\overline{x}=620$\r\n\r\n$N=25$\r\n\r\n&nbsp;\r\n\r\nWe still want to know the probability of a score of 620 if the training course didn't contribute to the gain, i.e., <strong>the probability of a score of 620 <em>under the condition of the null hypothesis<\/em>.<\/strong>\r\n<ul>\r\n \t<li>H<sub>0<\/sub>: The training course did not affect productivity (the 620 score was due to random chance); $\\mu_\\overline{x}$ $=\\mu$.<\/li>\r\n \t<li>H<sub>a<\/sub>: The training course affected productivity (the 620 score was a true gain);\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0$\\mu_\\overline{x}$ $\\neq\\mu$.<\/li>\r\n<\/ul>\r\nThe new standard error is:\r\n\r\n&nbsp;\r\n\r\n$\\sigma_\\overline{x}$ $=\\frac{\\sigma}{\\sqrt{N}}=\\frac{100}{\\sqrt{25}}=\\frac{100}{5}=20$\r\n\r\n&nbsp;\r\n\r\nThen the <em>z<\/em>-value of 620 is:\r\n\r\n&nbsp;\r\n\r\n$z=\\frac{\\overline{x}-\\mu}{\\sigma_\\overline{x}}$ $=\\frac{620-600}{20}=\\frac{20}{20}=1$\r\n\r\n&nbsp;\r\n<p style=\"padding-left: 30px\">Given the properties of the normal curve, we know that 68% of all means in infinite sampling will fall between\u00a0\u00b11 standard error (i.e, between 580 and 620), 95% will fall between\u00a0\u00b11.96 standard errors (i.e., approximately between 560 and 640), and 99% will fall between\u00a0\u00b12.58 standard errors (i.e., approximately between 540 and 660). The score of 620 has\u00a0$z=1$ -- it falls quite close to the not-trained group's mean of 600.<\/p>\r\n&nbsp;\r\n\r\nIn terms of probabilities, consider the following: z=1 has a <em>p&gt;<\/em>0.30.\u00a0<strong>Assuming the null hypothesis is true, our calculations show that the 620 score will appear more than 30% of the time due to random chance, which is a lot more than the 5% (at<em>\u00a0\u03b1<\/em>=0.05) that we are willing to tolerate. As such, we cannot reject the null hypothesis: we do not have enough evidence to conclude that the gain in productivity of 20 points which the 25 employees demonstrated is statistically significant. In other words, we don't have enough evidence that the training course was effective.<\/strong> (This doesn't mean that it didn't beyond a shadow of a doubt, just that<em> at this point in this particular study we don't have enough evidence to say it did<\/em>.)\r\n\r\n&nbsp;\r\n\r\nWe can also see the correspondence with confidence intervals:\r\n<ul>\r\n \t<li>95% CI:\u00a0$\\overline{x}\\pm1.96\\times\\sigma_\\overline{x}$ $= 620\\pm1.96\\times20=620\\pm39.2=(580.8; 659.2)$<\/li>\r\n<\/ul>\r\nThat is, we can\u00a0be 95% certain that the average score for the population of employees who take the training course would be between roughly 581 points and 659 points. <strong>The average general score of 600 points is a plausible value for\u00a0$\\mu_\\overline{x}$, which is consistent with our decision to not reject the null hypothesis.<\/strong>\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\n<\/div>\r\n&nbsp;\r\n\r\nAgain, Example 8.2 is a heuristic device, used only to explain the logic of hypotheses testing. Of course, normally we wouldn't have information about population parameters and we will be using sample statistics (i.e., we would use not only the sample mean $\\overline{x}$ but also the sample standard deviation s, to calculate the estimated sampling distribution $s_\\overline{x}$). (Not to mention that we would have two different standard deviations, one for the trained group and one for the not-trained group of employees.) As you learned in the previous chapter, this moves us from using the <em>z<\/em>-distribution to the <em>t<\/em>-distribution with given degrees of freedom. Recall that with a sample size of about 100 -- i.e., with <em>df<\/em>=100 -- the two distributions converge.\r\n\r\n&nbsp;\r\n\r\nHere then is a quick-and-dirty method you can use as a preliminary indication of whether something will be statistically significant. Since <em>z<\/em>=1.96 corresponds to 5% probability (2.5% in each tail), and <em>z<\/em>=2.58 corresponds to 1% probability (0.5% in each tail), even without knowing the exact <em>p<\/em>-value associated with a given <em>z<\/em>-value, you can guess that getting a <em>z<\/em>&lt;1.96 will be non-significant while a <em>z<\/em>&gt;1.96 will be significant at\u00a0<em>\u03b1<\/em>=0.05; similarly, getting a <em>z<\/em>&gt;2.58 will be\u00a0 statistically significant at\u00a0<em>\u03b1<\/em>=0.01[footnote]Obviously, for negative <em>z<\/em>-values we'll have all these in reverse: -z&gt;-1.96 will be non-significant and -z&lt;-1.96 will be significant, etc.[\/footnote]. As samples used in sociological research are commonly of <em>N<\/em>&gt;100, the same insight applies to the corresponding <em>t<\/em>-values with <em>df<\/em>\u2265100. Understand, however, that this is not an official way to test hypotheses or report findings: to do that, <strong>you always need to report the <em>exact p<\/em>-value associated with a <em>z<\/em>-value or a <em>t<\/em>-value with given <em>df<\/em><\/strong>[footnote]You can find a handy online <em>p<\/em>-value calculator of <em>t<\/em>-values here:<a href=\"https:\/\/goodcalculators.com\/student-t-value-calculator\/\"> https:\/\/goodcalculators.com\/student-t-value-calculator\/<\/a>. [\/footnote].\r\n\r\n&nbsp;\r\n\r\n<strong>One-tailed tests.<\/strong> Finally, a note on <em>one-tailed tests<\/em>. While at the beginner researcher level, I advise you against using them yourself, it is not a bad idea to know they exist and what they are. Briefly, the idea is that if we have a good reason to suspect not only a difference\/effect but a difference\/effect with a specific direction (i.e., positive or negative), we can specify the hypotheses accordingly. To use Example 8.2(A) again, say, we think there is no possibility that the training course <em>decreased<\/em> productivity scores. Then we can state the hypotheses as:\r\n<ul>\r\n \t<li>H<sub>0<\/sub>: The training course either did not affect productivity or <em>decreased<\/em> it; $\\mu_\\overline{x}$ \u2264$\\mu$.<\/li>\r\n \t<li>H<sub>a<\/sub>: The training course <em>increased<\/em> productivity;\u00a0 $\\mu_\\overline{x}$ &gt;$\\mu$.<\/li>\r\n<\/ul>\r\n&nbsp;\r\n\r\nThis is a stronger claim (that's why it needs to be well-justified) -- we test not a difference (that can be either positive or negative) but an <em>increase<\/em>. Thus, we move the significance level to only <em>one<\/em> of the tails, as it were, the positive (right) tail, so instead of 2.5% being there, 5% are.\r\n\r\n&nbsp;\r\n\r\nThis change in probability essentially \"moves\" the <em>z<\/em>-value corresponding to significance closer to the mean; now a smaller <em>z<\/em>-value will have the <em>p<\/em>-value necessary to achieve statistical significance. To be precise, 5% (2.5% in each tail) corresponded to <em>z<\/em>=1.96; all 5% in the <em>right<\/em> tail corresponds to <em>z<\/em>=1.65[footnote]You can check it here by selecting \"up to Z\": <a href=\"https:\/\/www.mathsisfun.com\/data\/standard-normal-distribution-table.html\">https:\/\/www.mathsisfun.com\/data\/standard-normal-distribution-table.html<\/a>.[\/footnote]. This obviously \"lowers the bar\" of achieving statistical significance <em>without changing the level of significance\u00a0\u03b1\u00a0itself<\/em>, and makes rejecting the null hypothesis easier, hence my description of the two-tailed test as more conservative (and my insistence on using it instead of a one-tailed test).\r\n\r\n&nbsp;\r\n\r\nBefore we move on to the last section of this theoretical chapter, the promised warning about the meanings of the term <em>significance<\/em>.\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--learning-objectives\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch out!! #15\u00a0<\/strong><\/span>... for Mistaking Statistical Significance for Magnitude or Importance<\/em><\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n&nbsp;\r\n\r\nIf you have been paying attention, you have learned by now that statistical significance has a very narrow meaning. To have a statistically significant result simply means that the probability of observing our sample statistics (or difference, or effect, etc.) as they are, given that the null hypothesis is true, is small enough to be (highly) unusual, to be so relatively rare as to indicate what we have is not a result of random sampling variation but of untrue null hypothesis.\r\n\r\n&nbsp;\r\n\r\nNone of this says <em>anything<\/em> about how <em>big<\/em> a difference\/effect is -- in fact it can be quite small, and still <em>statistically<\/em> significant, given large enough sample size and other study specifications[footnote]<span style=\"font-size: 1rem\">This is actually one of the reasons some have called for abandoning <em>p<\/em>-values, statistical significance, and hypothesis testing whatsoever, because statistical significance is not indicative of effect size and is frequently over-stated to mean more than it does; at the same time over-reliance on <em>p<\/em>-values decreases attention to effect size, careful study design, context, etc.<\/span><span style=\"text-indent: 1em;font-size: 1rem\">[\/footnote].<\/span>\r\n\r\n&nbsp;\r\n\r\nSimilarly, many people unfamiliar with statistics take statistical significance to mean that the finding are of significant <em>importance<\/em>. Again, nothing about statistical significance confers great meaning to or implies importance of statistically significant findings. One can study an objectively trivial\/unimportant issue and have statistically significant findings of no relevance to anyone whatsoever.\r\n\r\n&nbsp;\r\n\r\nTo conclude, keep these distinctions in mind -- between the conventional usage of the word <em>significant<\/em> (meaning either important, or big) and <em>statistical<\/em> significance -- both when interpreting and reporting results and when reading and evaluating existing research.\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\n<\/div>\r\n&nbsp;\r\n\r\nWhen testing hypotheses, I defined the significance level as sort of probability of being wrong we are willing to tolerate. This implies that a likelihood of making an\u00a0<i>erroneous\u00a0<\/i>decision about the null hypothesis (to reject it or not) exists. The next and final section deals with just that.\r\n\r\n&nbsp;","rendered":"<p>The concept\u00a0<em>level of significance<\/em> is used to adjudicate whether the probability (of our results if the null hypothesis is true) is too high to dismiss the null hypothesis or low enough to allow us to reject the null hypothesis. In other words, the level of significance is what we use to proclaim results as statistically significant (when we reject the null hypothesis) or not statistically significant (when we fail to reject the null hypothesis).<\/p>\n<p>&nbsp;<\/p>\n<p>Think about it this way: recall that with confidence intervals we had selected 95% certainty and 99% certainty as meaningful levels of confidence. What is left is 5% and 1% &#8220;uncertainty&#8221;, as it were, which we agree to tolerate. These 5% or 1% are distributed equally between the two tails of the normal distribution (2.5% on each side or 0.5% on each side, respectively). They also correspond to <em>z<\/em>=1.96 and <em>z<\/em>=2.58. Following the logic of Example 8.2 (A) from the preivious section, in order to reject a null hypothesis, we want the probability to be lower that these 5% or 1% (so that we can &#8220;feel confident enough&#8221;).<\/p>\n<p>&nbsp;<\/p>\n<p>And this is exactly it: When we put it that way, saying that we want the probability (of the null hypothesis being true) &#8212; called a <em>p-value<\/em> &#8212; to be less than 5%, we have essentially set the level of significance at 0.05. If we want the probability to be less than 1%, we have set the level of significance at 0.01. We can go even further: we might want to be extra cautious and to want\u00a0<span style=\"text-indent: 18.6667px;font-size: 14pt\">a &#8220;confidence&#8221; of 99.99%<\/span><span style=\"text-indent: 1em;font-size: 14pt\">, so that we want the probability to be less than 0.01% &#8212; then we have set the level of significance at 0.001.\u00a0\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>These three numbers &#8212; 0.05, 0.01, and 0.001 &#8212; are the most commonly used levels of significance. The level of significance is denoted by the small-case Greek letter <em>a<\/em>, i.e.,\u00a0<em>\u03b1<\/em>, thus we usually choose one of the following:<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-1e7a70605f68e89ed47b3cb961fcd722_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#48;&#53;\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"66\" style=\"vertical-align: 0px;\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-37cbad7dd6cbfce0029e7398590b30b4_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#48;&#49;\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"66\" style=\"vertical-align: -1px;\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-d150d697c2ad6e17f2f441211262cb10_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#48;&#48;&#49;\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"75\" style=\"vertical-align: -1px;\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>You can think of the significance level as the acceptable probability of being wrong<\/strong> &#8212; and what is acceptable is left to the discretion of the researcher, subject to the purposes of the particular study.<\/p>\n<p>&nbsp;<\/p>\n<p>Following the logic presented in Example 8.2(A) then, <strong>if the probability of the result under the null hypothesis &#8212; the p-value &#8212; is smaller than a pre-selected significance level\u00a0<em>\u03b1<\/em>, the null hypothesis is rejected and the result is considered statistically significant<\/strong><a class=\"footnote\" title=\"Note the difference between \u03b1 and the p-value. While \u03b1 indicates what probability of being wrong we are willing to tolerate, the actual p-value we obtain is not the probability of being wrong. The p-value, again, is the probability of our result if the null hypothesis were true; in other words, if the null hypothesis is in fact true, and our p-value is, say, 0.03, we'd obtain our results 3% of the time simply due to random sampling error.\" id=\"return-footnote-2083-1\" href=\"#footnote-2083-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a>. This is denoted in one of the following ways:<\/p>\n<p>&nbsp;<\/p>\n<p>p \u2264 0.05<\/p>\n<p>p \u2264 0.01<\/p>\n<p>p \u2264 0.001<a class=\"footnote\" title=\"In published research you will find results marked by one asterisk, two asterisks, and three asterisks. These correspond to their significance based on the level used: \u03b1=0.05, \u03b1=0.01, and \u03b1=0.001, respectively. The smaller the level of significance, the more strongly statistically significant the result is (i.e., most consider\u00a0\u03b1=0.001 to indicate &quot;highly statistically significant&quot; results). (If you happen upon a dagger (\u2020), it indicates significance at \u03b1=0.1 level, or 10% probability of being wrong, which most researchers consider too high, but some still use.\" id=\"return-footnote-2083-2\" href=\"#footnote-2083-2\" aria-label=\"Footnote 2\"><sup class=\"footnote\">[2]<\/sup><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><strong>To summarize, when a hypothesis is tested, we end up with an associated <em>p<\/em>-value (again, the probability of the observed sample statistics if the null hypothesis is true). We compare the <em>p<\/em>-value to the pre-selected significance level\u00a0<em>\u03b1<\/em>: if <em>p<\/em>\u00a0\u2264 <em>\u03b1<\/em>, the results are statistically significant and therefore generalizable to the population.<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p>So far so good? Good. However, unfortunately this isn&#8217;t all (sorry!). What I have presented above is the most conventional treatment of how to use and interpret\u00a0<em>p<\/em>-values. It is attractively straightforward\u00a0 &#8212; but it&#8217;s also arbitrary, and its<em> true<\/em> interpretation is subject of an ongoing debate. As an introduction to the topic, I will leave it at that but you should be aware that there&#8217;s more to the <em>p<\/em>-value, and that its usage has been (rightfully) questioned and\/or challenged in recent years.<a class=\"footnote\" title=\"You can find plenty of information on the topic online; from journals banning the use of p-values and hypothesis testing in favour of effect size (the Journal of Applied and Social Psychology, see Trafimow &amp; Marks, 2015 https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/01973533.2015.1012991), to calls to abandon statistical significance (e.g., McShane, Gal, Gelman, Robert &amp; Tackett, 2019\u00a0https:\/\/www.tandfonline.com\/doi\/abs\/10.1080\/00031305.2018.1527253), to others calling for its and p-values' defense (e.g., Kuffner &amp; Walker, 2016\u00a0https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2016.1277161?src=recsys; Greenland, 2019 https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2018.1529625?src=recsys). One thing is clear: p-values and levels of significance have become increasingly controversial. Still, the American Statistical Association's position is that although caution against over-reliance on a single indicator is necessary, p-values can still be used, alongside with other appropriate methods: &quot;Researchers should recognize that a\u00a0p-value without context or other evidence provides limited information. For example, a\u00a0p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large\u00a0p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a\u00a0p-value when other approaches are appropriate and feasible&quot; (Wasserstein &amp; Lazar, 2016https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2016.1154108?src=recsys). Finally, if you really want to not to overstate what the p-value actually shows, see Greenland et al. (2016) for a of common misinterpretations and over-interpretations of the p-value, of confidence intervals, and tests significance (here: https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4877414\/). Because of its enormity, the topic is still conventionally taught as I presented it above (as it goes way beyond the scope of this book), at least at introductory level.\" id=\"return-footnote-2083-3\" href=\"#footnote-2083-3\" aria-label=\"Footnote 3\"><sup class=\"footnote\">[3]<\/sup><\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>Going back to our example from the preivious section, let&#8217;s see how the <em>p<\/em>-values can change due to particular features of the study, like the sample size. Example 8.2(B) illustrates.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em>Example 8.2(B) Employee Productivity (Finding Statistically Non-significant Results, N=25)<\/em><\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>&nbsp;<\/p>\n<p>Imagine that we had the same information as in Example 8.2(A), however, 25 employees took the training course instead of 100 and their average score was 620. The we have:<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-11ccc1d2c4d7f975394af3e644e85052_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;&#61;&#54;&#48;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"61\" style=\"vertical-align: -4px;\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-e2f06951af1f90d1edbde95de8b751ef_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#115;&#105;&#103;&#109;&#97;&#61;&#49;&#48;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"61\" style=\"vertical-align: -1px;\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-ab8396793a7ab1701b6a688108f6aa1e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;&#61;&#54;&#50;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"61\" style=\"vertical-align: 0px;\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-efc717d25315d8d4a38ea9ee05ae34a1_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#78;&#61;&#50;&#53;\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"57\" style=\"vertical-align: 0px;\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>We still want to know the probability of a score of 620 if the training course didn&#8217;t contribute to the gain, i.e., <strong>the probability of a score of 620 <em>under the condition of the null hypothesis<\/em>.<\/strong><\/p>\n<ul>\n<li>H<sub>0<\/sub>: The training course did not affect productivity (the 620 score was due to random chance); <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-b0f6659031b03d0225ccaadcac32d125_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"20\" style=\"vertical-align: -4px;\" \/> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-1bc592cc22578fa3843a56b786c46152_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#61;&#92;&#109;&#117;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"30\" style=\"vertical-align: -4px;\" \/>.<\/li>\n<li>H<sub>a<\/sub>: The training course affected productivity (the 620 score was a true gain);\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-b0f6659031b03d0225ccaadcac32d125_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"20\" style=\"vertical-align: -4px;\" \/> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-e6cd1cd37530fb5f35b6b8a05c8dd07d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#110;&#101;&#113;&#92;&#109;&#117;\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"30\" style=\"vertical-align: -4px;\" \/>.<\/li>\n<\/ul>\n<p>The new standard error is:<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-de5382c00a55332dd89774492d104d0c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#115;&#105;&#103;&#109;&#97;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"19\" style=\"vertical-align: -3px;\" \/> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-0ab9e22061a344137920b3c4e17914c9_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#92;&#115;&#105;&#103;&#109;&#97;&#125;&#123;&#92;&#115;&#113;&#114;&#116;&#123;&#78;&#125;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#48;&#48;&#125;&#123;&#92;&#115;&#113;&#114;&#116;&#123;&#50;&#53;&#125;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#48;&#48;&#125;&#123;&#53;&#125;&#61;&#50;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"27\" width=\"189\" style=\"vertical-align: -11px;\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Then the <em>z<\/em>-value of 620 is:<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-d7b246c692b62bc4c366d4720b038763_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#122;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;&#45;&#92;&#109;&#117;&#125;&#123;&#92;&#115;&#105;&#103;&#109;&#97;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"24\" width=\"62\" style=\"vertical-align: -8px;\" \/> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-3d063a288719d54c1d09823c9c8b6f50_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#54;&#50;&#48;&#45;&#54;&#48;&#48;&#125;&#123;&#50;&#48;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#50;&#48;&#125;&#123;&#50;&#48;&#125;&#61;&#49;\" title=\"Rendered by QuickLaTeX.com\" height=\"22\" width=\"148\" style=\"vertical-align: -6px;\" \/><\/p>\n<p>&nbsp;<\/p>\n<p style=\"padding-left: 30px\">Given the properties of the normal curve, we know that 68% of all means in infinite sampling will fall between\u00a0\u00b11 standard error (i.e, between 580 and 620), 95% will fall between\u00a0\u00b11.96 standard errors (i.e., approximately between 560 and 640), and 99% will fall between\u00a0\u00b12.58 standard errors (i.e., approximately between 540 and 660). The score of 620 has\u00a0<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-74790055f00b0c3de5373bd351d017fd_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#122;&#61;&#49;\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"41\" style=\"vertical-align: -1px;\" \/> &#8212; it falls quite close to the not-trained group&#8217;s mean of 600.<\/p>\n<p>&nbsp;<\/p>\n<p>In terms of probabilities, consider the following: z=1 has a <em>p&gt;<\/em>0.30.\u00a0<strong>Assuming the null hypothesis is true, our calculations show that the 620 score will appear more than 30% of the time due to random chance, which is a lot more than the 5% (at<em>\u00a0\u03b1<\/em>=0.05) that we are willing to tolerate. As such, we cannot reject the null hypothesis: we do not have enough evidence to conclude that the gain in productivity of 20 points which the 25 employees demonstrated is statistically significant. In other words, we don&#8217;t have enough evidence that the training course was effective.<\/strong> (This doesn&#8217;t mean that it didn&#8217;t beyond a shadow of a doubt, just that<em> at this point in this particular study we don&#8217;t have enough evidence to say it did<\/em>.)<\/p>\n<p>&nbsp;<\/p>\n<p>We can also see the correspondence with confidence intervals:<\/p>\n<ul>\n<li>95% CI:\u00a0<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-a351f5fdfc09218fe229cc2ad0270c2e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;&#92;&#112;&#109;&#49;&#46;&#57;&#54;&#92;&#116;&#105;&#109;&#101;&#115;&#92;&#115;&#105;&#103;&#109;&#97;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"104\" style=\"vertical-align: -3px;\" \/> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-1d7038b5ffae991f7ad16d02a254933a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#61;&#32;&#54;&#50;&#48;&#92;&#112;&#109;&#49;&#46;&#57;&#54;&#92;&#116;&#105;&#109;&#101;&#115;&#50;&#48;&#61;&#54;&#50;&#48;&#92;&#112;&#109;&#51;&#57;&#46;&#50;&#61;&#40;&#53;&#56;&#48;&#46;&#56;&#59;&#32;&#54;&#53;&#57;&#46;&#50;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"367\" style=\"vertical-align: -4px;\" \/><\/li>\n<\/ul>\n<p>That is, we can\u00a0be 95% certain that the average score for the population of employees who take the training course would be between roughly 581 points and 659 points. <strong>The average general score of 600 points is a plausible value for\u00a0<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-b0f6659031b03d0225ccaadcac32d125_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"20\" style=\"vertical-align: -4px;\" \/>, which is consistent with our decision to not reject the null hypothesis.<\/strong><\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Again, Example 8.2 is a heuristic device, used only to explain the logic of hypotheses testing. Of course, normally we wouldn&#8217;t have information about population parameters and we will be using sample statistics (i.e., we would use not only the sample mean <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-0d00c2da2b2541a97ae0ac3c10e1504e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"11\" style=\"vertical-align: 0px;\" \/> but also the sample standard deviation s, to calculate the estimated sampling distribution <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-5ee7b9a8ecba4b54a50c56e76a5e2ff1_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#115;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"17\" style=\"vertical-align: -3px;\" \/>). (Not to mention that we would have two different standard deviations, one for the trained group and one for the not-trained group of employees.) As you learned in the previous chapter, this moves us from using the <em>z<\/em>-distribution to the <em>t<\/em>-distribution with given degrees of freedom. Recall that with a sample size of about 100 &#8212; i.e., with <em>df<\/em>=100 &#8212; the two distributions converge.<\/p>\n<p>&nbsp;<\/p>\n<p>Here then is a quick-and-dirty method you can use as a preliminary indication of whether something will be statistically significant. Since <em>z<\/em>=1.96 corresponds to 5% probability (2.5% in each tail), and <em>z<\/em>=2.58 corresponds to 1% probability (0.5% in each tail), even without knowing the exact <em>p<\/em>-value associated with a given <em>z<\/em>-value, you can guess that getting a <em>z<\/em>&lt;1.96 will be non-significant while a <em>z<\/em>&gt;1.96 will be significant at\u00a0<em>\u03b1<\/em>=0.05; similarly, getting a <em>z<\/em>&gt;2.58 will be\u00a0 statistically significant at\u00a0<em>\u03b1<\/em>=0.01<a class=\"footnote\" title=\"Obviously, for negative z-values we'll have all these in reverse: -z&gt;-1.96 will be non-significant and -z&lt;-1.96 will be significant, etc.\" id=\"return-footnote-2083-4\" href=\"#footnote-2083-4\" aria-label=\"Footnote 4\"><sup class=\"footnote\">[4]<\/sup><\/a>. As samples used in sociological research are commonly of <em>N<\/em>&gt;100, the same insight applies to the corresponding <em>t<\/em>-values with <em>df<\/em>\u2265100. Understand, however, that this is not an official way to test hypotheses or report findings: to do that, <strong>you always need to report the <em>exact p<\/em>-value associated with a <em>z<\/em>-value or a <em>t<\/em>-value with given <em>df<\/em><\/strong><a class=\"footnote\" title=\"You can find a handy online p-value calculator of t-values here: https:\/\/goodcalculators.com\/student-t-value-calculator\/.\" id=\"return-footnote-2083-5\" href=\"#footnote-2083-5\" aria-label=\"Footnote 5\"><sup class=\"footnote\">[5]<\/sup><\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>One-tailed tests.<\/strong> Finally, a note on <em>one-tailed tests<\/em>. While at the beginner researcher level, I advise you against using them yourself, it is not a bad idea to know they exist and what they are. Briefly, the idea is that if we have a good reason to suspect not only a difference\/effect but a difference\/effect with a specific direction (i.e., positive or negative), we can specify the hypotheses accordingly. To use Example 8.2(A) again, say, we think there is no possibility that the training course <em>decreased<\/em> productivity scores. Then we can state the hypotheses as:<\/p>\n<ul>\n<li>H<sub>0<\/sub>: The training course either did not affect productivity or <em>decreased<\/em> it; <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-b0f6659031b03d0225ccaadcac32d125_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"20\" style=\"vertical-align: -4px;\" \/> \u2264<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-461fe1a58a75801541487ddf10d32abd_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\" style=\"vertical-align: -4px;\" \/>.<\/li>\n<li>H<sub>a<\/sub>: The training course <em>increased<\/em> productivity;\u00a0 <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-b0f6659031b03d0225ccaadcac32d125_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;&#95;&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"20\" style=\"vertical-align: -4px;\" \/> &gt;<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-461fe1a58a75801541487ddf10d32abd_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#109;&#117;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\" style=\"vertical-align: -4px;\" \/>.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p>This is a stronger claim (that&#8217;s why it needs to be well-justified) &#8212; we test not a difference (that can be either positive or negative) but an <em>increase<\/em>. Thus, we move the significance level to only <em>one<\/em> of the tails, as it were, the positive (right) tail, so instead of 2.5% being there, 5% are.<\/p>\n<p>&nbsp;<\/p>\n<p>This change in probability essentially &#8220;moves&#8221; the <em>z<\/em>-value corresponding to significance closer to the mean; now a smaller <em>z<\/em>-value will have the <em>p<\/em>-value necessary to achieve statistical significance. To be precise, 5% (2.5% in each tail) corresponded to <em>z<\/em>=1.96; all 5% in the <em>right<\/em> tail corresponds to <em>z<\/em>=1.65<a class=\"footnote\" title=\"You can check it here by selecting &quot;up to Z&quot;: https:\/\/www.mathsisfun.com\/data\/standard-normal-distribution-table.html.\" id=\"return-footnote-2083-6\" href=\"#footnote-2083-6\" aria-label=\"Footnote 6\"><sup class=\"footnote\">[6]<\/sup><\/a>. This obviously &#8220;lowers the bar&#8221; of achieving statistical significance <em>without changing the level of significance\u00a0\u03b1\u00a0itself<\/em>, and makes rejecting the null hypothesis easier, hence my description of the two-tailed test as more conservative (and my insistence on using it instead of a one-tailed test).<\/p>\n<p>&nbsp;<\/p>\n<p>Before we move on to the last section of this theoretical chapter, the promised warning about the meanings of the term <em>significance<\/em>.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--learning-objectives\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch out!! #15\u00a0<\/strong><\/span>&#8230; for Mistaking Statistical Significance for Magnitude or Importance<\/em><\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>&nbsp;<\/p>\n<p>If you have been paying attention, you have learned by now that statistical significance has a very narrow meaning. To have a statistically significant result simply means that the probability of observing our sample statistics (or difference, or effect, etc.) as they are, given that the null hypothesis is true, is small enough to be (highly) unusual, to be so relatively rare as to indicate what we have is not a result of random sampling variation but of untrue null hypothesis.<\/p>\n<p>&nbsp;<\/p>\n<p>None of this says <em>anything<\/em> about how <em>big<\/em> a difference\/effect is &#8212; in fact it can be quite small, and still <em>statistically<\/em> significant, given large enough sample size and other study specifications<a class=\"footnote\" title=\"This is actually one of the reasons some have called for abandoning p-values, statistical significance, and hypothesis testing whatsoever, because statistical significance is not indicative of effect size and is frequently over-stated to mean more than it does; at the same time over-reliance on p-values decreases attention to effect size, careful study design, context, etc.\" id=\"return-footnote-2083-7\" href=\"#footnote-2083-7\" aria-label=\"Footnote 7\"><sup class=\"footnote\">[7]<\/sup><\/a>.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>Similarly, many people unfamiliar with statistics take statistical significance to mean that the finding are of significant <em>importance<\/em>. Again, nothing about statistical significance confers great meaning to or implies importance of statistically significant findings. One can study an objectively trivial\/unimportant issue and have statistically significant findings of no relevance to anyone whatsoever.<\/p>\n<p>&nbsp;<\/p>\n<p>To conclude, keep these distinctions in mind &#8212; between the conventional usage of the word <em>significant<\/em> (meaning either important, or big) and <em>statistical<\/em> significance &#8212; both when interpreting and reporting results and when reading and evaluating existing research.<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>When testing hypotheses, I defined the significance level as sort of probability of being wrong we are willing to tolerate. This implies that a likelihood of making an\u00a0<i>erroneous\u00a0<\/i>decision about the null hypothesis (to reject it or not) exists. The next and final section deals with just that.<\/p>\n<p>&nbsp;<\/p>\n<hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-2083-1\">Note the difference between <em>\u03b1<\/em> and the <em>p<\/em>-value. While <em>\u03b1<\/em> indicates what probability of being wrong we are willing to tolerate, the actual<em> p<\/em>-value we obtain is <em>not<\/em> the probability of being wrong. The <em>p<\/em>-value, again, is the probability of our result if the null hypothesis were true; in other words, if the null hypothesis is in fact true, and our <em>p<\/em>-value is, say, 0.03, we'd obtain our results 3% of the time simply due to random sampling error.  <a href=\"#return-footnote-2083-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><li id=\"footnote-2083-2\">In published research you will find results marked by one asterisk, two asterisks, and three asterisks. These correspond to their significance based on the level used: <em>\u03b1<\/em>=0.05, <em>\u03b1<\/em>=0.01, and <em>\u03b1<\/em>=0.001, respectively. The smaller the level of significance, the more strongly statistically significant the result is (i.e., most consider\u00a0<em>\u03b1<\/em>=0.001 to indicate \"highly statistically significant\" results). (If you happen upon a dagger (\u2020), it indicates significance at <em>\u03b1<\/em>=0.1 level, or 10% probability of being wrong, which most researchers consider too high, but some still use. <a href=\"#return-footnote-2083-2\" class=\"return-footnote\" aria-label=\"Return to footnote 2\">&crarr;<\/a><\/li><li id=\"footnote-2083-3\"> You can find plenty of information on the topic online; from journals banning the use of <em>p<\/em>-values and hypothesis testing in favour of effect size (the <em>Journal of Applied and Social Psychology<\/em>, see Trafimow &amp; Marks, 2015 https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/01973533.2015.1012991), to calls to abandon statistical significance (e.g., McShane, Gal, Gelman, Robert &amp; Tackett, 2019\u00a0https:\/\/www.tandfonline.com\/doi\/abs\/10.1080\/00031305.2018.1527253), to others calling for its and <em>p<\/em>-values' defense (e.g., Kuffner &amp; Walker, 2016\u00a0https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2016.1277161?src=recsys; Greenland, 2019 https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2018.1529625?src=recsys). One thing is clear: <em>p<\/em>-values and levels of significance have become increasingly controversial. Still, the American Statistical Association's position is that although caution against over-reliance on a single indicator is necessary, <em>p<\/em>-values can still be used, <em>alongside with other appropriate methods<\/em>: \"<span>Researchers should recognize that a\u00a0<\/span><i>p<\/i><span>-value without context or other evidence provides limited information. For example, a\u00a0<\/span><i>p<\/i><span>-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large\u00a0<\/span><i>p<\/i><span>-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a\u00a0<\/span><i>p<\/i><span>-value when other approaches are appropriate and feasible\" (Wasserstein &amp; Lazar, 2016https:\/\/www.tandfonline.com\/doi\/full\/10.1080\/00031305.2016.1154108?src=recsys). Finally, if you really want to not to overstate what the <em>p<\/em>-value actually shows, see Greenland et al. (2016) for a of common misinterpretations and over-interpretations of the <em>p<\/em>-value, of confidence intervals, and tests significance (here: <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4877414\/\">https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4877414\/<\/a>). Because of its enormity, the topic is still conventionally taught as I presented it above (as it goes way beyond the scope of this book), at least at introductory level.<\/span> <a href=\"#return-footnote-2083-3\" class=\"return-footnote\" aria-label=\"Return to footnote 3\">&crarr;<\/a><\/li><li id=\"footnote-2083-4\">Obviously, for negative <em>z<\/em>-values we'll have all these in reverse: -z&gt;-1.96 will be non-significant and -z&lt;-1.96 will be significant, etc. <a href=\"#return-footnote-2083-4\" class=\"return-footnote\" aria-label=\"Return to footnote 4\">&crarr;<\/a><\/li><li id=\"footnote-2083-5\">You can find a handy online <em>p<\/em>-value calculator of <em>t<\/em>-values here:<a href=\"https:\/\/goodcalculators.com\/student-t-value-calculator\/\"> https:\/\/goodcalculators.com\/student-t-value-calculator\/<\/a>.  <a href=\"#return-footnote-2083-5\" class=\"return-footnote\" aria-label=\"Return to footnote 5\">&crarr;<\/a><\/li><li id=\"footnote-2083-6\">You can check it here by selecting \"up to Z\": <a href=\"https:\/\/www.mathsisfun.com\/data\/standard-normal-distribution-table.html\">https:\/\/www.mathsisfun.com\/data\/standard-normal-distribution-table.html<\/a>. <a href=\"#return-footnote-2083-6\" class=\"return-footnote\" aria-label=\"Return to footnote 6\">&crarr;<\/a><\/li><li id=\"footnote-2083-7\"><span style=\"font-size: 1rem\">This is actually one of the reasons some have called for abandoning <em>p<\/em>-values, statistical significance, and hypothesis testing whatsoever, because statistical significance is not indicative of effect size and is frequently over-stated to mean more than it does; at the same time over-reliance on <em>p<\/em>-values decreases attention to effect size, careful study design, context, etc.<\/span><span style=\"text-indent: 1em;font-size: 1rem\"> <a href=\"#return-footnote-2083-7\" class=\"return-footnote\" aria-label=\"Return to footnote 7\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":533,"menu_order":4,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-2083","chapter","type-chapter","status-publish","hentry"],"part":1051,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/2083","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/users\/533"}],"version-history":[{"count":7,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/2083\/revisions"}],"predecessor-version":[{"id":2088,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/2083\/revisions\/2088"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/parts\/1051"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/2083\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/media?parent=2083"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapter-type?post=2083"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/contributor?post=2083"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/license?post=2083"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}