6.5 The Sampling Distribution

Mariana Gatzeva

Chapter 6 Sampling, the Basis of Inference

6.5 The Sampling Distribution

With this section we reach a point where you will have to make a good use of your imagination and abstract thinking. Unlike our presentation and discussion of variables early on, giving real-life examples for this material becomes impossible as the sampling distribution lies firmly in the realms of abstract mathematical concepts. Yet we need it because it’s the sampling distribution which makes inference possible and bridges the gap between a sample and the population from which it was taken.

Thus, as promised in my introduction to keep everything to its most necessary minimum to be understandable, below I offer as non-technical and non-mathematical explanation of what the sampling distribution is and how we use it as possible. However, this course of action has its obvious inevitable downsides: since we are skipping the actual mathematical proofs and going directly for the results of these, you will have to accept the presentation at my word. This is a hard thing to ask of anyone (“it is what it is because I tell you so”). My justification for doing it is because the vast majority of my students so far seem to find the alternative (“it is what it is because of all this very long presentation of complex mathematical concepts and complicated procedures”) even more unpalatable, without any gains in comprehensibility — and, as such, ultimately mostly useless. (Of course, if interested, you can always check other, more comprehensive books and online sources.)

Despite the dire warning about upcoming doom in the form of abstract concepts, I still begin with an example.

Example 6.2 Age of Classmates

Imagine you are enrolled in a class along with 49 other students, so the total class size is 50. Let’s say as a class assignment (perhaps in a research methods class) you are tasked with taking a sample of your class and administering a survey to your sample. In this sense, your class is your population of interest. For simplicity’s sake, we focus on one possible question, say, age of respondent. You want to know the average age of the study population but, instead of asking all 50 of your classmates, you draw a random sample of them for the purposes of estimating the class’s average age^[1].

Now despite that I still haven’t said anything about sample size (but we’re getting there), I’ll assume that a sample size of 10 (i.e., 20 percent of the population) would sound reasonable enough to you. The random draw (with replacement) yields the following ten classmate’s ages:

19, 19, 20, 20, 20, 21, 21, 22, 23, 28

Based on these values, the average age of the sample, $\overline{x}$ , is

$\overline{x}=\frac{\sum\limits_{i=1}^{N}{x_i}}{N}=\frac{(19)2+(20)3+(21)2+22+23+28}{10}= \frac{213}{10}=21.3$

I.e., your sample’s average age is 21.3 years. Considering that these ten people were randomly drawn, and that they are, well, only ten, can we assume that the average age of your entire class of 50 is 21.3 years?

While this is a good — educated even — guess and a good starting point, it is unlikely that, had you polled everyone in the class, your calculation would have produced exactly 21.3. After all, polling 10 people is not the same as polling 50; in the latter case your calculation would include a lot more information than in the former. Thus, it’s also reasonable to expect that there will be some difference between the mean based on the sample, $overline{x}$ , and the true population mean, μ.

Then how about if you decided to draw another random sample of ten people out of your class? Would you expect to have the exact same mean of 21.3 years?

Unless you somehow end up with the exact same ten people who were in the first sample (and after Chapter 5 on probability you should know how minuscule that probability is), it is again unlikely you’d get the same mean.

We could imagine that the new, second sample’s ages might look like this:

18, 19, 19, 19, 20, 20, 22, 22, 24, 25

Based on these ten new values, the average age of the second sample (let’s call it $\overline{x_2}$ ) is:

$\overline{x_2}=\frac{\sum\limits_{i=1}^{N}{x_i}}{N}=\frac{18+(19)3+(20)2+(22)2+24+25}{10}= \frac{208}{10}=20.8$

I.e., your second sample’s average age is 20.8 years, despite it being drawn from the same population. Your two samples (of the same size) yielded two close — but still different — numbers.

As well, following the same logic, it’s just as unlikely that the population mean μ (your class’s average age) is 20.8 years as it was unlikely that it’s 21.3 years (the sample is still only 10 people).

What then? How can we trust a sample statistic to estimate a population parameter? It appears we need more information. Before we get to that, however, let’s finally address the elephant in the room – the issue of sample size I have been neglecting so far.

One reason you might think the sample estimates in the Example 6.2 above differ (both from each other and from the true population mean) could be the sample size: isn’t N=10 just too small? The answer is of the yes-but-no variety: No, a sample size that’s 20 percent of the population size is actually quite big for a research study of a typical, relatively large size. Yes, a sample of 10 out of population of 50 is way too small. And, in general, yes, the larger the sample the better. But let’s unpack– and qualify — all of these three contradicting pieces of information properly.

Inferential statistics — at least the typical kind discussed in this textbook — is about estimating relatively large populations; luckily, quantitative social science research most commonly deals with such populations^[2].

The recommended sample size depends on the size of the population it will be used to estimate but at diminishing returns: the larger the population, the larger the sample’s absolute size should generally be — but at the same time the gains of the larger sample size diminish (to zero), the larger the population is. In other words, smaller populations need samples of bigger proportion to represent them correctly, while larger populations need samples of smaller (and smaller, and smaller) proportions to do so. (This also means that even if you have larger and larger populations, there will be no gains in increasing the sample size beyond a certain point.)

In reality, no one would try estimating the parameters of a population as small as 50, as in most cases they can be easily obtained — not to mention that to have a meaningful estimate of a population that small, one would indeed need almost the entire sample. Sample size calculators are abundant and free online^[3] but to give you an idea of the diminishing returns to increasing sample size, I’ll just list a few. To estimate a population of 200, you’ll typically need a sample of about 180^[4]; to estimate a population of 500, you’ll typically need a sample of about 380; to estimate a population of 1,000, a sample of 600 would be adequate; for a population of 2,000, a sample of about 870 would work; for a population of 5,000, a sample of 1,200 would be enough; for a population of 10,000, a sample of about 1,300 would be enough… then for a population of 50,000, a sample of about only 1,500 would suffice, and a population of 100,000 would do just as well with the same number of 1,500^[5].

What it comes down to is that, to the surprise of many, actually a sample size of “just” 1,500 respondents can safely and accurately estimate any population 25,000+. This also means that a random sample of 1,500 people can statistically represent, for example, both the population of Toronto (2.7+ mln. people) and the population of Canada (36.7 mln. people) — however, it cannot be the same sample (the former needs to be drawn of Torontonians only, the latter of all Canadians^[6].

Watch Out!! #11… for (Mis)Judging a Study On Its Sample Size

The point against judging a study on its sample size alone should be clear already but it bears repeating. When people unfamiliar with statistics encounter social-scientific reports based on studies of what they consider a “too small” sample size, they tend to dismiss the findings. They tend to consider the “only 500 respondents” or “only 1000 cases” too few to accurately represent the population from which they were drawn, especially if the population is, in their view, disproportionately large.

As you should have learned by now, the generalizability of a study is more a matter of how the sample is drawn, not of its size (beyond a certain point). As long as the chosen sampling method is a type of random sampling, and the sample size is adequate for the population size^[7], the results of the study will be generalizable to the population. The actual sample size doesn’t matter for that, even if it may look “too small” to some.

In any event, even if it’s from a certain point on unnecessary as demonstrated above, as a logical inevitability, the closer the sample is in size to the population from which it is drawn, the smaller the difference between statistics and parameters should be. Even in the Example 6.2 above with its imagined, only-for-illustration-purposes population of 50, getting information from 40 of your classmates instead of the 10 we used in the example should get us an average age that is closer to the true population age (of all 50 students).

However, as a corollary, unless we obtain information from truly everyone (i.e., we do a census), in random sampling a difference between the sample statistic and the population parameter will always exist^[8]. This difference between the estimate (the statistic) and what is being estimated (the parameter) is called random error. Random error is inevitable — no matter what we do, a sample will always only produce an estimate, never the “real thing”, as it were.

Going back to Example 6.2 above, we can extrapolate that when randomly drawing a sample after sample after sample (of the same size) an infinite number of times, and calculating a mean after a mean after a mean, we’ll get a long (well, infinite) number of means which will all be somewhat close to, but not exactly equal to, the true population mean. If you could possibly imagine this very long (infinite^[9]) list of means as similar to a variable with a large number of observations, please do so, it helps.

This variable you imagined (made of the very large number of means that would be produced by the very large number of samples if we took them) will have a frequency distribution just like any real variable we have discussed so far. The distribution of the variable made of the means is called the sampling distribution of the mean^[10]. However, since all this is theoretical (we do not take more than one sample), this distribution is not really about actual frequencies but rather about expected/relative frequencies, i.e. probabilities. As such, the sampling distribution is a probability distribution — it lists a mean’s probability of occurring^[11].

In a more precise phrasing, all statistics based on samples (e.g., means, medians, deviations, etc. plus many others we haven’t yet encountered) have a sampling distribution, which refers to their theoretical^[12] variability over repeated (to infinity) random samples of specific (and equal) size. What we know about the sampling distribution of sample statistics is summarized in the Central Limit Theorem, next.

Of course, with a population of only 50 in real life you can just collect the information from everyone. I'm using a small-size example for teaching purposes only and to make calculations manageable. The principle of sampling applies equally not only to a population of 50 but of any size -- and when your study population's size is in the millions, you wouldn't attempt to survey all of them (barring the already discussed case for censuses). ↵
There is no magic number as to what constitutes a relatively large population, and therefore an adequate minimum requirement for a sample size. For the latter, I could offer 100; some suggest 30, others 50 but in truth all these are more or less arbitrary. It is a fact that having a larger sample (both in absolute and proportionate sense) puts you in a safer ground in terms of statistical inference (this has to do with probability theory, the law of large numbers, the sampling distribution, the normal curve, and the Central Limit Theorem discussed below for which to work, say, a minimum N=30 is a frequently cited number). What you can take out of this is that it's better to avoid dealing with N<30 (or even N<100 if possible), as the tools and methods discussed in this textbook are better suited for larger sample (and population) size. ↵
You can find one example at SurveyMoneky.com (https://www.surveymonkey.com/mp/sample-size-calculator/?ut_source=help_center). ↵
Here and on "typically" refers to a frequently used margin of error of ±2.5%; more on what this actually means in Section 6.6 below. ↵
You can also find a table summarizing sample size like this one useful: https://www.research-advisors.com/tools/SampleSize.htm. ↵
In truth, researchers do want larger samples to represent Canada (or other countries' populations) but that's only to increase the power (defined later) of their statistical findings, not their generalizability. This desire for larger N is, of course, constrained by limited resources (time, money, etc.). ↵
At the desired -- and reported -- margin of error. ↵
Well, almost always: it is possible (though very unlikely) that a sample will just so happen to produce the true population parameter. This will also be a result of random chance, as unlikely as it may be. ↵
For ease of imagination, I'll stick to "very long/large" from now on, but at the far back of your mind, remember it's actually infinite. ↵
I provide this definition only to make understanding the sampling distribution easier. It's in no way the technical definition of the sampling distribution. As well, keep in mind that this "variable" made of the means is a perfectly imaginary heuristic device. ↵
Compare this to flipping a coin, or throwing a die: as we saw, in both cases the distribution of the possible outcomes (over infinite number of flips/throws) is a calculated and known probability distribution. After all, that's why we know that the probability of getting tails or heads is 0.5 in theory, just like it's 0.167 for throwing any of the die's six numbers in theory (even if calculating actual flipped/thrown frequencies in real life yields different results). ↵
It is theoretical because we do not actually take multiple, much less infinite, number of samples as there is no need: courtesy of probability theory and the Central Limit Theorem, we just know what would happen if we did. ↵

6.5 The Sampling Distribution

License

Share This Book