{"id":94,"date":"2018-10-31T17:37:55","date_gmt":"2018-10-31T21:37:55","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/simplestats\/?post_type=chapter&#038;p=94"},"modified":"2019-12-09T15:46:20","modified_gmt":"2019-12-09T20:46:20","slug":"6-5-the-sampling-distribution","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/simplestats\/chapter\/6-5-the-sampling-distribution\/","title":{"raw":"6.5 The Sampling Distribution","rendered":"6.5 The Sampling Distribution"},"content":{"raw":"[latexpage]\r\n\r\nWith this section we reach a point where you will have to make a good use of your imagination and abstract thinking.\u00a0 Unlike our presentation and discussion of variables early on, giving real-life examples for this material becomes impossible as the sampling distribution lies firmly in the realms of abstract mathematical concepts. Yet we need it because it's the sampling distribution which makes inference possible and bridges the gap between a sample and the population from which it was taken.\r\n\r\n&nbsp;\r\n\r\nThus, as promised in my introduction to keep everything to its most necessary minimum to be understandable, below I offer as non-technical and non-mathematical explanation of what the sampling distribution is and how we use it as possible. However, this course of action has its obvious inevitable downsides: since we are skipping the actual mathematical proofs and going directly for the results of these, you will have to accept the presentation at my word. This is a hard thing to ask of anyone (<em>\"it is what it is because I tell you so\"<\/em>). My justification for doing it is because the vast majority of my students so far seem to find the alternative (<em>\"it is what it is because of all this very long presentation of complex mathematical concepts and complicated procedures\"<\/em>) even more unpalatable, without any gains in comprehensibility -- and, as such, ultimately mostly useless. (Of course, if interested, you can always check other, more comprehensive books and online sources.)\r\n\r\n&nbsp;\r\n\r\nDespite the dire warning about upcoming doom in the form of abstract concepts, I still begin with an example.\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em>Example 6.2\u00a0Age of Classmates<\/em><\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n&nbsp;\r\n\r\nImagine you are enrolled in a class along with 49 other students, so the total class size is 50. Let's say as a class assignment (perhaps in a research methods class) you are tasked with taking a sample of your class and administering a survey to your sample. In this sense, your class is your population of interest. For simplicity's sake, we focus on one possible question, say,\u00a0<em>age of respondent<\/em>. You want to know the average age of the study population but, instead of asking all 50 of your classmates, you draw a random sample of them for the purposes of estimating the class's average age[footnote]Of course, with a population of only 50 in real life you can just collect the information from everyone. I'm using a small-size example for teaching purposes only and to make calculations manageable. The principle of sampling applies equally not only to a population of 50 but of any size -- and when your study population's size is in the millions, you wouldn't attempt to survey all of them (barring the already discussed case for censuses).[\/footnote].\r\n\r\n&nbsp;\r\n\r\nNow despite that I still haven't said anything about sample size (but we're getting there), I'll assume that a sample size of 10 (i.e., 20 percent of the population) would sound reasonable enough to you. The random draw (with replacement) yields the following ten classmate's ages:\r\n\r\n&nbsp;\r\n\r\n19, 19, 20, 20, 20, 21, 21, 22, 23, 28\r\n\r\n&nbsp;\r\n\r\nBased on these values, the average age of the sample, $\\overline{x}$, is\r\n\r\n&nbsp;\r\n\r\n$ \\overline{x}=\\frac{\\sum\\limits_{i=1}^{N}{x_i}}{N}=\\frac{(19)2+(20)3+(21)2+22+23+28}{10}= \\frac{213}{10}=21.3$\r\n\r\n&nbsp;\r\n\r\nI.e., your sample's average age is 21.3 years. Considering that these ten people were randomly drawn, and that they are, well, only <em>ten<\/em>, can we assume that the average age of your <em>entire<\/em> class of 50 is 21.3 years?\r\n\r\n&nbsp;\r\n\r\nWhile this is a good -- <em>educated<\/em>\u00a0even -- guess and a good starting point, <strong>it is unlikely that, had you polled everyone in the class, your calculation would have produced <em>exactly<\/em> 21.3<\/strong>. After all, polling 10 people is not the same as polling 50; in the latter case your calculation would include a lot more information than in the former. Thus, it's also reasonable to expect that there will be <em>some<\/em> difference between the mean based on the sample, $overline{x}$, and the <em>true<\/em> population mean,\u00a0<em>\u03bc<\/em>.\r\n\r\n&nbsp;\r\n\r\nThen how about if you decided to draw another random sample of ten people out of your class? Would you expect to have the exact same mean of 21.3 years?\r\n\r\n&nbsp;\r\n\r\nUnless you somehow end up with the exact same ten people who were in the first sample (and after Chapter 5 on probability you should know how minuscule that probability is), it is again unlikely you'd get the same mean.\r\n\r\n&nbsp;\r\n\r\nWe could imagine that the new, second sample's ages might look like this:\r\n\r\n&nbsp;\r\n\r\n18, 19, 19, 19, 20, 20, 22, 22, 24, 25\r\n\r\n&nbsp;\r\n\r\nBased on these ten new values, the average age of the second sample (let's call it $\\overline{x_2}$) is:\r\n\r\n&nbsp;\r\n\r\n$\\overline{x_2}=\\frac{\\sum\\limits_{i=1}^{N}{x_i}}{N}=\\frac{18+(19)3+(20)2+(22)2+24+25}{10}= \\frac{208}{10}=20.8$\r\n\r\n&nbsp;\r\n\r\nI.e., your <em>second<\/em> sample's average age is 20.8 years, despite it being drawn from the same population. Your two samples (of the same size) yielded two close -- but still different -- numbers.\r\n\r\n&nbsp;\r\n\r\nAs well, following the same logic, it's just as unlikely that the population mean\u00a0<em>\u03bc<\/em> (your class's average age) is 20.8 years as it was unlikely that it's 21.3 years (the sample is still only 10 people).\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\n<\/div>\r\n&nbsp;\r\n\r\nWhat then? How can we trust a sample statistic to estimate a population parameter? It appears we need more information. Before we get to that, however, let's finally address the elephant in the room - the issue of <em>sample size<\/em> I have been neglecting so far.\r\n\r\n&nbsp;\r\n\r\nOne reason you might think the sample estimates in the Example 6.2 above differ (both from each other and from the true population mean) could be the sample size: isn't <em>N<\/em>=10 just too small? The answer is of the <i>yes-but-no\u00a0<\/i>variety: No, a sample size that's 20 percent of the population size is actually quite big for a research study of a typical, relatively large size. Yes, a sample of 10 out of population of 50 <em>is<\/em>\u00a0way too small. And, in general, yes, the larger the sample the better. But let's unpack-- and qualify -- all of these three contradicting pieces of information properly.\r\n\r\n&nbsp;\r\n\r\nInferential statistics -- at least the typical kind discussed in this textbook -- is about estimating <em>relatively large<\/em> populations; luckily, quantitative social science research most commonly deals with such populations[footnote]There is no magic number as to what constitutes a relatively large population, and therefore an adequate minimum requirement for a sample size. For the latter, I could offer 100; some suggest 30, others 50 but in truth all these are more or less arbitrary. It <em>is<\/em> a fact that having a larger sample (both in absolute and proportionate sense) puts you in a safer ground in terms of statistical inference (this has to do with probability theory, the law of large numbers, the sampling distribution, the normal curve, and the Central Limit Theorem discussed below for which to work, say, a minimum <em>N<\/em>=30 is a frequently cited number). What you can take out of this is that it's better to avoid dealing with\u00a0<span style=\"text-indent: 1em;font-size: 14pt\"><em>N<\/em>&lt;30 (or even <em>N<\/em>&lt;100 if possible), as the tools and methods discussed in this textbook are better suited for larger sample (and population) size. [\/footnote].\u00a0<\/span>\r\n\r\n&nbsp;\r\n\r\n<strong style=\"text-indent: 1em;font-size: 14pt\">The recommended sample size depends on the size of the population it will be used to estimate but at <em>diminishing returns<\/em>: the larger the population, the larger the sample's <em>absolute <\/em>size should generally be -- but at the same time the gains of the larger sample size diminish (to zero), the larger the population is.\u00a0<\/strong><span style=\"text-indent: 1em;font-size: 14pt\">In other words, <strong>smaller populations need samples of bigger proportion to represent them correctly, while larger populations need samples of smaller (and smaller, and smaller) proportions to do so.<\/strong> (This also means that even if you have larger and larger populations, there will be no gains in increasing the sample size beyond a certain point.)<\/span>\r\n\r\n&nbsp;\r\n\r\nIn reality, no one would try<em> estimating<\/em> the parameters of a population as small as 50, as in most cases they can be easily obtained -- not to mention that to have a meaningful estimate of a population that small, one would indeed need almost the entire sample. Sample size calculators are abundant and free online[footnote]You can find one example\u00a0at SurveyMoneky.com (<a href=\"https:\/\/www.surveymonkey.com\/mp\/sample-size-calculator\/?ut_source=help_center\">https:\/\/www.surveymonkey.com\/mp\/sample-size-calculator\/?ut_source=help_center<\/a>).[\/footnote] but to give you an idea of the diminishing returns to increasing sample size, I'll just list a few. To estimate a population of 200, you'll typically need a sample of about 180[footnote]Here and on \"typically\" refers to a frequently used <em>margin of error<\/em> of \u00b12.5%; more on what this actually means in Section 6.6 below.[\/footnote]; to estimate a population of 500, you'll typically need a sample of about 380; to estimate a population of 1,000, a sample of 600 would be adequate; for a population of 2,000, a sample of about 870 would work; for a population of 5,000, a sample of 1,200 would be enough; for a population of 10,000, a sample of about 1,300 would be enough... then for a population of 50,000, a sample of about only 1,500 would suffice, and a population of 100,000 would do just as well with the same number of 1,500[footnote]You can also find a table summarizing sample size like this one useful: <a href=\"https:\/\/www.research-advisors.com\/tools\/SampleSize.htm\">https:\/\/www.research-advisors.com\/tools\/SampleSize.htm<\/a>.[\/footnote].\r\n\r\n&nbsp;\r\n\r\nWhat it comes down to is that, to the surprise of many, actually<strong> a sample size of \"just\" 1,500 respondents can safely and accurately estimate any population 25,000+.<\/strong> This also means that a random sample of 1,500 people can statistically represent, for example, both the population of Toronto (2.7+ mln. people) <em>and<\/em> the population of Canada (36.7 mln. people) -- however, it cannot be the <em>same<\/em> sample (the former needs to be drawn of Torontonians only, the latter of all Canadians[footnote]In truth, researchers do want larger samples to represent Canada (or other countries' populations) but that's only to increase the <em>power<\/em>\u00a0(defined later) of their statistical findings, not their generalizability. This desire for larger <em>N<\/em> is, of course, constrained by limited resources (time, money, etc.).[\/footnote].\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--learning-objectives\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch Out!! #11<\/strong><\/span>... for (Mis)Judging a Study On Its Sample Size<\/em><\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n&nbsp;\r\n\r\nThe point against judging a study on its sample size alone should be clear already but it bears repeating. When people unfamiliar with statistics encounter social-scientific reports based on studies of what they consider a \"too small\" sample size, they tend to dismiss the findings. They tend to consider the \"only 500 respondents\" or \"only 1000 cases\" too few to accurately represent the population from which they were drawn, especially if the population is, in their view, disproportionately large.\r\n\r\n&nbsp;\r\n\r\nAs you should have learned by now, the generalizability of a study is more a matter of <em>how<\/em> the sample is drawn, not of its size (beyond a certain point). As long as the chosen sampling method is a type of random sampling, and the sample size is adequate for the population size[footnote]At the desired -- and reported -- margin of error.[\/footnote], the results of the study will be generalizable to the population. The actual sample size doesn't matter for that, even if it may look \"too small\" to some.\r\n\r\n<\/div>\r\n<\/div>\r\n&nbsp;\r\n\r\nIn any event, even if it's from a certain point on <em>unnecessary<\/em> as demonstrated above, as a logical inevitability, the closer the sample is in size to the population from which it is drawn, the smaller the difference between statistics and parameters should be. Even in the Example 6.2 above with its imagined, only-<span style=\"text-indent: 18.6667px;font-size: 14pt\">for-illustration-purposes<\/span><span style=\"text-indent: 1em;font-size: 14pt\">\u00a0population of 50, getting information from 40 of your classmates instead of the 10 we used in the example should get us an average age that is closer to the true population age (of all 50 students).\u00a0 <\/span>\r\n\r\n&nbsp;\r\n\r\n<span style=\"text-indent: 1em;font-size: 14pt\">However, as a corollary, unless we obtain information from truly everyone (i.e., we do a census), <strong>in random sampling a difference between the sample statistic and the population parameter will always exist<\/strong>[footnote]Well, <em>almost<\/em> always: it is possible (though very unlikely) that a sample will just so happen to produce the true population parameter. This will also be a result of random chance, as unlikely as it may be.[\/footnote]<strong>.<\/strong>\u00a0<strong>This difference between the estimate (the statistic)\u00a0 and what is being estimated (the parameter) is called <em>random error<\/em>.<\/strong>\u00a0Random error is <em>inevitable<\/em> -- no matter what we do, a sample will always only produce an estimate, never the \"real thing\", as it were.<\/span>\r\n\r\n&nbsp;\r\n\r\nGoing back to Example 6.2 above, we can extrapolate that when randomly drawing a sample after sample after sample (of the same size) an infinite number of times, and calculating a mean after a mean after a mean, we'll get a long (well, <em>infinite<\/em>) number of means which will all be somewhat close to, but not exactly equal to, the true population mean. If you could possibly imagine this very long (infinite[footnote]For ease of imagination, I'll stick to \"very long\/large\" from now on, but at the far back of your mind, remember it's actually infinite.[\/footnote]) list of means as similar to a variable with a large number of observations, please do so, it helps.\r\n\r\n&nbsp;\r\n\r\nThis variable you imagined (made of the very large number of means that would be produced by the very large number of samples if we took them) will have a frequency distribution just like any real variable we have discussed so far. <strong>The <\/strong><strong>distribution of the variable made of the means is called the <em>sampling distribution of the mean<\/em><\/strong>[footnote]I provide this definition only to make understanding the sampling distribution easier. It's in no way the technical definition of the sampling distribution. As well, keep in mind that this \"variable\" made of the means is a perfectly imaginary heuristic device.[\/footnote]<strong>.<\/strong> However, since all this is <em>theoretical<\/em> (we do not take more than one sample), this distribution is not really about actual frequencies but rather about expected\/relative frequencies, i.e. probabilities. As such, <strong>the sampling distribution is a <em>probability distribution\u00a0<\/em><span style=\"text-indent: 18.6667px;font-size: 14pt\">-- it lists a mean's <em>probability<\/em> of occurring<\/span><\/strong><span style=\"text-indent: 1em;font-size: 14pt\">[footnote]Compare this to flipping a coin, or throwing a die: as we saw, in both cases the distribution of the <em>possible<\/em> outcomes (over infinite number of flips\/throws) is a calculated and known probability distribution. After all, that's why we know that the probability of getting tails or heads is 0.5<em> in theory<\/em>, just like it's 0.167 for throwing any of the die's six numbers <em>in theory<\/em> (even if calculating actual flipped\/thrown frequencies in real life yields different results).[\/footnote]<strong>.<\/strong><\/span>\r\n\r\n&nbsp;\r\n\r\nIn a more precise phrasing, all statistics based on samples (e.g., means, medians, deviations, etc. plus many others we haven't yet encountered) have a sampling distribution, which refers to their theoretical[footnote]It is theoretical because we do not actually take multiple, much less infinite, number of samples as there is no need: courtesy of probability theory and the Central Limit Theorem, we just <em>know<\/em> what <em>would<\/em> happen if we did.[\/footnote] variability over repeated (to infinity) random samples of specific (and equal) size. What we know about the sampling distribution of sample statistics is summarized in the Central Limit Theorem, next.\r\n\r\n&nbsp;","rendered":"<p>With this section we reach a point where you will have to make a good use of your imagination and abstract thinking.\u00a0 Unlike our presentation and discussion of variables early on, giving real-life examples for this material becomes impossible as the sampling distribution lies firmly in the realms of abstract mathematical concepts. Yet we need it because it&#8217;s the sampling distribution which makes inference possible and bridges the gap between a sample and the population from which it was taken.<\/p>\n<p>&nbsp;<\/p>\n<p>Thus, as promised in my introduction to keep everything to its most necessary minimum to be understandable, below I offer as non-technical and non-mathematical explanation of what the sampling distribution is and how we use it as possible. However, this course of action has its obvious inevitable downsides: since we are skipping the actual mathematical proofs and going directly for the results of these, you will have to accept the presentation at my word. This is a hard thing to ask of anyone (<em>&#8220;it is what it is because I tell you so&#8221;<\/em>). My justification for doing it is because the vast majority of my students so far seem to find the alternative (<em>&#8220;it is what it is because of all this very long presentation of complex mathematical concepts and complicated procedures&#8221;<\/em>) even more unpalatable, without any gains in comprehensibility &#8212; and, as such, ultimately mostly useless. (Of course, if interested, you can always check other, more comprehensive books and online sources.)<\/p>\n<p>&nbsp;<\/p>\n<p>Despite the dire warning about upcoming doom in the form of abstract concepts, I still begin with an example.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em>Example 6.2\u00a0Age of Classmates<\/em><\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>&nbsp;<\/p>\n<p>Imagine you are enrolled in a class along with 49 other students, so the total class size is 50. Let&#8217;s say as a class assignment (perhaps in a research methods class) you are tasked with taking a sample of your class and administering a survey to your sample. In this sense, your class is your population of interest. For simplicity&#8217;s sake, we focus on one possible question, say,\u00a0<em>age of respondent<\/em>. You want to know the average age of the study population but, instead of asking all 50 of your classmates, you draw a random sample of them for the purposes of estimating the class&#8217;s average age<a class=\"footnote\" title=\"Of course, with a population of only 50 in real life you can just collect the information from everyone. I'm using a small-size example for teaching purposes only and to make calculations manageable. The principle of sampling applies equally not only to a population of 50 but of any size -- and when your study population's size is in the millions, you wouldn't attempt to survey all of them (barring the already discussed case for censuses).\" id=\"return-footnote-94-1\" href=\"#footnote-94-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>Now despite that I still haven&#8217;t said anything about sample size (but we&#8217;re getting there), I&#8217;ll assume that a sample size of 10 (i.e., 20 percent of the population) would sound reasonable enough to you. The random draw (with replacement) yields the following ten classmate&#8217;s ages:<\/p>\n<p>&nbsp;<\/p>\n<p>19, 19, 20, 20, 20, 21, 21, 22, 23, 28<\/p>\n<p>&nbsp;<\/p>\n<p>Based on these values, the average age of the sample, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-0d00c2da2b2541a97ae0ac3c10e1504e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"11\" style=\"vertical-align: 0px;\" \/>, is<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-6bc03131b1785e52c369a9369d0f59f5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#92;&#115;&#117;&#109;&#92;&#108;&#105;&#109;&#105;&#116;&#115;&#95;&#123;&#105;&#61;&#49;&#125;&#94;&#123;&#78;&#125;&#123;&#120;&#95;&#105;&#125;&#125;&#123;&#78;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#57;&#41;&#50;&#43;&#40;&#50;&#48;&#41;&#51;&#43;&#40;&#50;&#49;&#41;&#50;&#43;&#50;&#50;&#43;&#50;&#51;&#43;&#50;&#56;&#125;&#123;&#49;&#48;&#125;&#61;&#32;&#92;&#102;&#114;&#97;&#99;&#123;&#50;&#49;&#51;&#125;&#123;&#49;&#48;&#125;&#61;&#50;&#49;&#46;&#51;\" title=\"Rendered by QuickLaTeX.com\" height=\"44\" width=\"392\" style=\"vertical-align: -7px;\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>I.e., your sample&#8217;s average age is 21.3 years. Considering that these ten people were randomly drawn, and that they are, well, only <em>ten<\/em>, can we assume that the average age of your <em>entire<\/em> class of 50 is 21.3 years?<\/p>\n<p>&nbsp;<\/p>\n<p>While this is a good &#8212; <em>educated<\/em>\u00a0even &#8212; guess and a good starting point, <strong>it is unlikely that, had you polled everyone in the class, your calculation would have produced <em>exactly<\/em> 21.3<\/strong>. After all, polling 10 people is not the same as polling 50; in the latter case your calculation would include a lot more information than in the former. Thus, it&#8217;s also reasonable to expect that there will be <em>some<\/em> difference between the mean based on the sample, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-21ab0c9a91cab11429dd4b24fef1c9c5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"75\" style=\"vertical-align: 0px;\" \/>, and the <em>true<\/em> population mean,\u00a0<em>\u03bc<\/em>.<\/p>\n<p>&nbsp;<\/p>\n<p>Then how about if you decided to draw another random sample of ten people out of your class? Would you expect to have the exact same mean of 21.3 years?<\/p>\n<p>&nbsp;<\/p>\n<p>Unless you somehow end up with the exact same ten people who were in the first sample (and after Chapter 5 on probability you should know how minuscule that probability is), it is again unlikely you&#8217;d get the same mean.<\/p>\n<p>&nbsp;<\/p>\n<p>We could imagine that the new, second sample&#8217;s ages might look like this:<\/p>\n<p>&nbsp;<\/p>\n<p>18, 19, 19, 19, 20, 20, 22, 22, 24, 25<\/p>\n<p>&nbsp;<\/p>\n<p>Based on these ten new values, the average age of the second sample (let&#8217;s call it <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-5cb90fd035553e6365148687264024f8_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#95;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"14\" width=\"18\" style=\"vertical-align: -3px;\" \/>) is:<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-c516e08a77ad28f132f3642976382047_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#111;&#118;&#101;&#114;&#108;&#105;&#110;&#101;&#123;&#120;&#95;&#50;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#92;&#115;&#117;&#109;&#92;&#108;&#105;&#109;&#105;&#116;&#115;&#95;&#123;&#105;&#61;&#49;&#125;&#94;&#123;&#78;&#125;&#123;&#120;&#95;&#105;&#125;&#125;&#123;&#78;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#56;&#43;&#40;&#49;&#57;&#41;&#51;&#43;&#40;&#50;&#48;&#41;&#50;&#43;&#40;&#50;&#50;&#41;&#50;&#43;&#50;&#52;&#43;&#50;&#53;&#125;&#123;&#49;&#48;&#125;&#61;&#32;&#92;&#102;&#114;&#97;&#99;&#123;&#50;&#48;&#56;&#125;&#123;&#49;&#48;&#125;&#61;&#50;&#48;&#46;&#56;\" title=\"Rendered by QuickLaTeX.com\" height=\"44\" width=\"400\" style=\"vertical-align: -7px;\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>I.e., your <em>second<\/em> sample&#8217;s average age is 20.8 years, despite it being drawn from the same population. Your two samples (of the same size) yielded two close &#8212; but still different &#8212; numbers.<\/p>\n<p>&nbsp;<\/p>\n<p>As well, following the same logic, it&#8217;s just as unlikely that the population mean\u00a0<em>\u03bc<\/em> (your class&#8217;s average age) is 20.8 years as it was unlikely that it&#8217;s 21.3 years (the sample is still only 10 people).<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>What then? How can we trust a sample statistic to estimate a population parameter? It appears we need more information. Before we get to that, however, let&#8217;s finally address the elephant in the room &#8211; the issue of <em>sample size<\/em> I have been neglecting so far.<\/p>\n<p>&nbsp;<\/p>\n<p>One reason you might think the sample estimates in the Example 6.2 above differ (both from each other and from the true population mean) could be the sample size: isn&#8217;t <em>N<\/em>=10 just too small? The answer is of the <i>yes-but-no\u00a0<\/i>variety: No, a sample size that&#8217;s 20 percent of the population size is actually quite big for a research study of a typical, relatively large size. Yes, a sample of 10 out of population of 50 <em>is<\/em>\u00a0way too small. And, in general, yes, the larger the sample the better. But let&#8217;s unpack&#8211; and qualify &#8212; all of these three contradicting pieces of information properly.<\/p>\n<p>&nbsp;<\/p>\n<p>Inferential statistics &#8212; at least the typical kind discussed in this textbook &#8212; is about estimating <em>relatively large<\/em> populations; luckily, quantitative social science research most commonly deals with such populations<a class=\"footnote\" title=\"There is no magic number as to what constitutes a relatively large population, and therefore an adequate minimum requirement for a sample size. For the latter, I could offer 100; some suggest 30, others 50 but in truth all these are more or less arbitrary. It is a fact that having a larger sample (both in absolute and proportionate sense) puts you in a safer ground in terms of statistical inference (this has to do with probability theory, the law of large numbers, the sampling distribution, the normal curve, and the Central Limit Theorem discussed below for which to work, say, a minimum N=30 is a frequently cited number). What you can take out of this is that it's better to avoid dealing with\u00a0N&lt;30 (or even N&lt;100 if possible), as the tools and methods discussed in this textbook are better suited for larger sample (and population) size.\" id=\"return-footnote-94-2\" href=\"#footnote-94-2\" aria-label=\"Footnote 2\"><sup class=\"footnote\">[2]<\/sup><\/a>.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><strong style=\"text-indent: 1em;font-size: 14pt\">The recommended sample size depends on the size of the population it will be used to estimate but at <em>diminishing returns<\/em>: the larger the population, the larger the sample&#8217;s <em>absolute <\/em>size should generally be &#8212; but at the same time the gains of the larger sample size diminish (to zero), the larger the population is.\u00a0<\/strong><span style=\"text-indent: 1em;font-size: 14pt\">In other words, <strong>smaller populations need samples of bigger proportion to represent them correctly, while larger populations need samples of smaller (and smaller, and smaller) proportions to do so.<\/strong> (This also means that even if you have larger and larger populations, there will be no gains in increasing the sample size beyond a certain point.)<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>In reality, no one would try<em> estimating<\/em> the parameters of a population as small as 50, as in most cases they can be easily obtained &#8212; not to mention that to have a meaningful estimate of a population that small, one would indeed need almost the entire sample. Sample size calculators are abundant and free online<a class=\"footnote\" title=\"You can find one example\u00a0at SurveyMoneky.com (https:\/\/www.surveymonkey.com\/mp\/sample-size-calculator\/?ut_source=help_center).\" id=\"return-footnote-94-3\" href=\"#footnote-94-3\" aria-label=\"Footnote 3\"><sup class=\"footnote\">[3]<\/sup><\/a> but to give you an idea of the diminishing returns to increasing sample size, I&#8217;ll just list a few. To estimate a population of 200, you&#8217;ll typically need a sample of about 180<a class=\"footnote\" title=\"Here and on &quot;typically&quot; refers to a frequently used margin of error of \u00b12.5%; more on what this actually means in Section 6.6 below.\" id=\"return-footnote-94-4\" href=\"#footnote-94-4\" aria-label=\"Footnote 4\"><sup class=\"footnote\">[4]<\/sup><\/a>; to estimate a population of 500, you&#8217;ll typically need a sample of about 380; to estimate a population of 1,000, a sample of 600 would be adequate; for a population of 2,000, a sample of about 870 would work; for a population of 5,000, a sample of 1,200 would be enough; for a population of 10,000, a sample of about 1,300 would be enough&#8230; then for a population of 50,000, a sample of about only 1,500 would suffice, and a population of 100,000 would do just as well with the same number of 1,500<a class=\"footnote\" title=\"You can also find a table summarizing sample size like this one useful: https:\/\/www.research-advisors.com\/tools\/SampleSize.htm.\" id=\"return-footnote-94-5\" href=\"#footnote-94-5\" aria-label=\"Footnote 5\"><sup class=\"footnote\">[5]<\/sup><\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>What it comes down to is that, to the surprise of many, actually<strong> a sample size of &#8220;just&#8221; 1,500 respondents can safely and accurately estimate any population 25,000+.<\/strong> This also means that a random sample of 1,500 people can statistically represent, for example, both the population of Toronto (2.7+ mln. people) <em>and<\/em> the population of Canada (36.7 mln. people) &#8212; however, it cannot be the <em>same<\/em> sample (the former needs to be drawn of Torontonians only, the latter of all Canadians<a class=\"footnote\" title=\"In truth, researchers do want larger samples to represent Canada (or other countries' populations) but that's only to increase the power\u00a0(defined later) of their statistical findings, not their generalizability. This desire for larger N is, of course, constrained by limited resources (time, money, etc.).\" id=\"return-footnote-94-6\" href=\"#footnote-94-6\" aria-label=\"Footnote 6\"><sup class=\"footnote\">[6]<\/sup><\/a>.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--learning-objectives\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch Out!! #11<\/strong><\/span>&#8230; for (Mis)Judging a Study On Its Sample Size<\/em><\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>&nbsp;<\/p>\n<p>The point against judging a study on its sample size alone should be clear already but it bears repeating. When people unfamiliar with statistics encounter social-scientific reports based on studies of what they consider a &#8220;too small&#8221; sample size, they tend to dismiss the findings. They tend to consider the &#8220;only 500 respondents&#8221; or &#8220;only 1000 cases&#8221; too few to accurately represent the population from which they were drawn, especially if the population is, in their view, disproportionately large.<\/p>\n<p>&nbsp;<\/p>\n<p>As you should have learned by now, the generalizability of a study is more a matter of <em>how<\/em> the sample is drawn, not of its size (beyond a certain point). As long as the chosen sampling method is a type of random sampling, and the sample size is adequate for the population size<a class=\"footnote\" title=\"At the desired -- and reported -- margin of error.\" id=\"return-footnote-94-7\" href=\"#footnote-94-7\" aria-label=\"Footnote 7\"><sup class=\"footnote\">[7]<\/sup><\/a>, the results of the study will be generalizable to the population. The actual sample size doesn&#8217;t matter for that, even if it may look &#8220;too small&#8221; to some.<\/p>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>In any event, even if it&#8217;s from a certain point on <em>unnecessary<\/em> as demonstrated above, as a logical inevitability, the closer the sample is in size to the population from which it is drawn, the smaller the difference between statistics and parameters should be. Even in the Example 6.2 above with its imagined, only-<span style=\"text-indent: 18.6667px;font-size: 14pt\">for-illustration-purposes<\/span><span style=\"text-indent: 1em;font-size: 14pt\">\u00a0population of 50, getting information from 40 of your classmates instead of the 10 we used in the example should get us an average age that is closer to the true population age (of all 50 students).\u00a0 <\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"text-indent: 1em;font-size: 14pt\">However, as a corollary, unless we obtain information from truly everyone (i.e., we do a census), <strong>in random sampling a difference between the sample statistic and the population parameter will always exist<\/strong><a class=\"footnote\" title=\"Well, almost always: it is possible (though very unlikely) that a sample will just so happen to produce the true population parameter. This will also be a result of random chance, as unlikely as it may be.\" id=\"return-footnote-94-8\" href=\"#footnote-94-8\" aria-label=\"Footnote 8\"><sup class=\"footnote\">[8]<\/sup><\/a><strong>.<\/strong>\u00a0<strong>This difference between the estimate (the statistic)\u00a0 and what is being estimated (the parameter) is called <em>random error<\/em>.<\/strong>\u00a0Random error is <em>inevitable<\/em> &#8212; no matter what we do, a sample will always only produce an estimate, never the &#8220;real thing&#8221;, as it were.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>Going back to Example 6.2 above, we can extrapolate that when randomly drawing a sample after sample after sample (of the same size) an infinite number of times, and calculating a mean after a mean after a mean, we&#8217;ll get a long (well, <em>infinite<\/em>) number of means which will all be somewhat close to, but not exactly equal to, the true population mean. If you could possibly imagine this very long (infinite<a class=\"footnote\" title=\"For ease of imagination, I'll stick to &quot;very long\/large&quot; from now on, but at the far back of your mind, remember it's actually infinite.\" id=\"return-footnote-94-9\" href=\"#footnote-94-9\" aria-label=\"Footnote 9\"><sup class=\"footnote\">[9]<\/sup><\/a>) list of means as similar to a variable with a large number of observations, please do so, it helps.<\/p>\n<p>&nbsp;<\/p>\n<p>This variable you imagined (made of the very large number of means that would be produced by the very large number of samples if we took them) will have a frequency distribution just like any real variable we have discussed so far. <strong>The <\/strong><strong>distribution of the variable made of the means is called the <em>sampling distribution of the mean<\/em><\/strong><a class=\"footnote\" title=\"I provide this definition only to make understanding the sampling distribution easier. It's in no way the technical definition of the sampling distribution. As well, keep in mind that this &quot;variable&quot; made of the means is a perfectly imaginary heuristic device.\" id=\"return-footnote-94-10\" href=\"#footnote-94-10\" aria-label=\"Footnote 10\"><sup class=\"footnote\">[10]<\/sup><\/a><strong>.<\/strong> However, since all this is <em>theoretical<\/em> (we do not take more than one sample), this distribution is not really about actual frequencies but rather about expected\/relative frequencies, i.e. probabilities. As such, <strong>the sampling distribution is a <em>probability distribution\u00a0<\/em><span style=\"text-indent: 18.6667px;font-size: 14pt\">&#8212; it lists a mean&#8217;s <em>probability<\/em> of occurring<\/span><\/strong><span style=\"text-indent: 1em;font-size: 14pt\"><a class=\"footnote\" title=\"Compare this to flipping a coin, or throwing a die: as we saw, in both cases the distribution of the possible outcomes (over infinite number of flips\/throws) is a calculated and known probability distribution. After all, that's why we know that the probability of getting tails or heads is 0.5 in theory, just like it's 0.167 for throwing any of the die's six numbers in theory (even if calculating actual flipped\/thrown frequencies in real life yields different results).\" id=\"return-footnote-94-11\" href=\"#footnote-94-11\" aria-label=\"Footnote 11\"><sup class=\"footnote\">[11]<\/sup><\/a><strong>.<\/strong><\/span><\/p>\n<p>&nbsp;<\/p>\n<p>In a more precise phrasing, all statistics based on samples (e.g., means, medians, deviations, etc. plus many others we haven&#8217;t yet encountered) have a sampling distribution, which refers to their theoretical<a class=\"footnote\" title=\"It is theoretical because we do not actually take multiple, much less infinite, number of samples as there is no need: courtesy of probability theory and the Central Limit Theorem, we just know what would happen if we did.\" id=\"return-footnote-94-12\" href=\"#footnote-94-12\" aria-label=\"Footnote 12\"><sup class=\"footnote\">[12]<\/sup><\/a> variability over repeated (to infinity) random samples of specific (and equal) size. What we know about the sampling distribution of sample statistics is summarized in the Central Limit Theorem, next.<\/p>\n<p>&nbsp;<\/p>\n<hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-94-1\">Of course, with a population of only 50 in real life you can just collect the information from everyone. I'm using a small-size example for teaching purposes only and to make calculations manageable. The principle of sampling applies equally not only to a population of 50 but of any size -- and when your study population's size is in the millions, you wouldn't attempt to survey all of them (barring the already discussed case for censuses). <a href=\"#return-footnote-94-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><li id=\"footnote-94-2\">There is no magic number as to what constitutes a relatively large population, and therefore an adequate minimum requirement for a sample size. For the latter, I could offer 100; some suggest 30, others 50 but in truth all these are more or less arbitrary. It <em>is<\/em> a fact that having a larger sample (both in absolute and proportionate sense) puts you in a safer ground in terms of statistical inference (this has to do with probability theory, the law of large numbers, the sampling distribution, the normal curve, and the Central Limit Theorem discussed below for which to work, say, a minimum <em>N<\/em>=30 is a frequently cited number). What you can take out of this is that it's better to avoid dealing with\u00a0<span style=\"text-indent: 1em;font-size: 14pt\"><em>N<\/em>&lt;30 (or even <em>N<\/em>&lt;100 if possible), as the tools and methods discussed in this textbook are better suited for larger sample (and population) size.  <a href=\"#return-footnote-94-2\" class=\"return-footnote\" aria-label=\"Return to footnote 2\">&crarr;<\/a><\/li><li id=\"footnote-94-3\">You can find one example\u00a0at SurveyMoneky.com (<a href=\"https:\/\/www.surveymonkey.com\/mp\/sample-size-calculator\/?ut_source=help_center\">https:\/\/www.surveymonkey.com\/mp\/sample-size-calculator\/?ut_source=help_center<\/a>). <a href=\"#return-footnote-94-3\" class=\"return-footnote\" aria-label=\"Return to footnote 3\">&crarr;<\/a><\/li><li id=\"footnote-94-4\">Here and on \"typically\" refers to a frequently used <em>margin of error<\/em> of \u00b12.5%; more on what this actually means in Section 6.6 below. <a href=\"#return-footnote-94-4\" class=\"return-footnote\" aria-label=\"Return to footnote 4\">&crarr;<\/a><\/li><li id=\"footnote-94-5\">You can also find a table summarizing sample size like this one useful: <a href=\"https:\/\/www.research-advisors.com\/tools\/SampleSize.htm\">https:\/\/www.research-advisors.com\/tools\/SampleSize.htm<\/a>. <a href=\"#return-footnote-94-5\" class=\"return-footnote\" aria-label=\"Return to footnote 5\">&crarr;<\/a><\/li><li id=\"footnote-94-6\">In truth, researchers do want larger samples to represent Canada (or other countries' populations) but that's only to increase the <em>power<\/em>\u00a0(defined later) of their statistical findings, not their generalizability. This desire for larger <em>N<\/em> is, of course, constrained by limited resources (time, money, etc.). <a href=\"#return-footnote-94-6\" class=\"return-footnote\" aria-label=\"Return to footnote 6\">&crarr;<\/a><\/li><li id=\"footnote-94-7\">At the desired -- and reported -- margin of error. <a href=\"#return-footnote-94-7\" class=\"return-footnote\" aria-label=\"Return to footnote 7\">&crarr;<\/a><\/li><li id=\"footnote-94-8\">Well, <em>almost<\/em> always: it is possible (though very unlikely) that a sample will just so happen to produce the true population parameter. This will also be a result of random chance, as unlikely as it may be. <a href=\"#return-footnote-94-8\" class=\"return-footnote\" aria-label=\"Return to footnote 8\">&crarr;<\/a><\/li><li id=\"footnote-94-9\">For ease of imagination, I'll stick to \"very long\/large\" from now on, but at the far back of your mind, remember it's actually infinite. <a href=\"#return-footnote-94-9\" class=\"return-footnote\" aria-label=\"Return to footnote 9\">&crarr;<\/a><\/li><li id=\"footnote-94-10\">I provide this definition only to make understanding the sampling distribution easier. It's in no way the technical definition of the sampling distribution. As well, keep in mind that this \"variable\" made of the means is a perfectly imaginary heuristic device. <a href=\"#return-footnote-94-10\" class=\"return-footnote\" aria-label=\"Return to footnote 10\">&crarr;<\/a><\/li><li id=\"footnote-94-11\">Compare this to flipping a coin, or throwing a die: as we saw, in both cases the distribution of the <em>possible<\/em> outcomes (over infinite number of flips\/throws) is a calculated and known probability distribution. After all, that's why we know that the probability of getting tails or heads is 0.5<em> in theory<\/em>, just like it's 0.167 for throwing any of the die's six numbers <em>in theory<\/em> (even if calculating actual flipped\/thrown frequencies in real life yields different results). <a href=\"#return-footnote-94-11\" class=\"return-footnote\" aria-label=\"Return to footnote 11\">&crarr;<\/a><\/li><li id=\"footnote-94-12\">It is theoretical because we do not actually take multiple, much less infinite, number of samples as there is no need: courtesy of probability theory and the Central Limit Theorem, we just <em>know<\/em> what <em>would<\/em> happen if we did. <a href=\"#return-footnote-94-12\" class=\"return-footnote\" aria-label=\"Return to footnote 12\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":533,"menu_order":5,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-94","chapter","type-chapter","status-publish","hentry"],"part":32,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/94","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/users\/533"}],"version-history":[{"count":25,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/94\/revisions"}],"predecessor-version":[{"id":730,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/94\/revisions\/730"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/parts\/32"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/94\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/media?parent=94"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapter-type?post=94"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/contributor?post=94"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/license?post=94"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}