Chapter 6 Sampling, the Basis of Inference

6.3 Random Sampling

In order to be able to use what we know about probability distributions and the normal curve and to be able to apply this knowledge in the service of inference (how exactly we do that comes later in the chapter), we need to know the probabilities of the population elements to be selected. The problem is, estimating these probabilities (every time, for each and any new study) can be way too burdensome, if not outright impossible. Consider the following example.

 

Example 6.1 Mode of Transportation of Students

 

Imagine that you are interested in what mode of transportation the students in your university usually take to campus. You decide that a sample of N=100 sounds reasonable. Imagine further that you don’t know anything about sampling (or logic) so you decide to go to the nearest bus stop to your campus and talk to the first hundred students that happen to come by once you’re there.

 

Arguably, if you did that, you could expect close to 100 percent of your sample to choose bus as their usual mode of transportation to school — after all, you have talked only to students waiting at a bus stop. True, it’s possible that some of your respondents were taking the bus only at that particular time (their car might have broken down, or they didn’t feel like driving that day, etc.) but it’s hardly likely this to be the case for more than a few out of the selected hundred. 

 

So far, what you could learn from your study is that some hundred (or close to it) students in your university happen to usually take the bus to school. In and of itself, there is nothing wrong with that. The question, however, is whether you can use this information to conclude that bus is the usual form of transportation for students in your university in general. To paraphrase in the language of research: is the information regarding usual modes of transportation gathered by you from a hundred students at a bus stop generalizable to your institution’s student body as a whole?

 

Even going by logic alone you should be able to easily see that the answer is no, of course not. After all, you only talked to students at a bus stop who were there specifically to take the bus, at a specific time, on a specific day. What about the students that directly went to the parking lot to take their cars, or those who went to retrieve their bikes from the bike racks, or who simply walked home? Then what about students who had no classes on the day that you went to the bus stop? Or the students that were in class at the time you were interviewing your subjects? Or the students in your institution whose classes were at a different campus and never came to the one you happened to be in?

 

In short, your method of collecting information had produced a biased sample: some elements in it (students who happened to be taking the bus at the time of your survey) had a higher chance at participating in your study than others (everyone else). The sample is biased toward bus-takers — those who you talked to had something like 100 percent chance to be in the study (and they did); other bus-takers who weren’t there had a smaller but still potential chance to be in the study, and those who never take the bus had 0 percent probability to be in your study.

 

What’s more, not only are the probabilities to be selected different for the different students, calculating the exact probability for every element in every new study and accounting for the differences would be a fool’s errand, as unfeasible (or outright impossible) as collecting information on the entire population under study in the general case.

 

The takeaway from Example 6.1 is that in statistics we want elements to have easily known (to make calculations easy) and equal (so as to not produce bias) probabilities to be selected. Fortunately for us, random sampling (also called probability sampling) provides both — as the way for the probabilities to be known is based on the fact that when chosen at random, the elements have the same/equal probability of  being chosen.

 

Recall that in a coin flip the probability of getting heads is the same as the probability of getting tails, and they are both \frac{1}{2}, one outcome out of two possible outcomes, or 0.5. The probabilities of throwing a die and getting a one, or a two, or a three, or a four, or a five, or a six are all equal, and known: \frac{1}{6}, one outcome out of six possible outcomes, or 0.167. Similarly, the probability of selecting one person at random out of a group of thirty-five people is the same for all thirty-five people, and equal to \frac{1}{35}, or 0.028.

 

Throwing dice, flipping coins, and selecting at random are all random (chance) events – there is no bias in them, as the probability of any outcome is the same as any other outcome, and easily calculatable as one out of the total number of possible outcomes.

 

If we apply the same logic to sampling, we can see that the only thing we need is to make sure that our selection is random and that it applies to all elements in a population of a particular known size: then the probability of selecting an element will be always one out of the total number of elements, i.e., the total study population size.

 

When this condition — equal probability of elements to be selected — is met and we know that probability, we know its frequency distribution.  We can thus use probability theory and its theorems and postulates which provide mathematical proof that a random (i.e., unbiased) sample reflects and represents the population from which it was drawn truthfully. Then and only then, whatever we learn from the sample would be generalizable to the population. (Of course, it’s not that simple; there is more to it — like sample size — but I’ll leave this for later when we get to the Central Limit Theorem). 

 

So what would have been the best way to get a representative answer to the question regarding usual modes of transportation for students in your institution? Theoretically, you could have obtained a list of all students from the registrar, selected your hundred at random from the full list, and contacted only the persons selected. Their responses would indicate the most popular mode of student transportation and now, with random sampling, they would reflect the entire student’s body.

 

In practice things are more complicated: How exactly do you chose at random any desired number of elements from a list of all elements?[1] How do you even obtain a list of all elements in the first place? Even if we had one, do we put every element’s name/number in a hat and pull them out one by one?

 

While providing details on how random sampling is done in real life is also outside the scope of this text, I can assure you several such methods exist (though pulling names out of hat isn’t one of them). For a comprehensive treatment, again, I encourage you to consult a research methods textbook; for my purposes here I will just list the major ones.

 

Simple random sampling is the closest that you can get to the pulling-names-out-of-a-hat proposition, however, in this day and age it is usually done with computers using random number generators. The same goes for systematic random sampling (when the selection starts at a random starting point and proceeds at a fixed interval). Then there are also stratified random sampling (the population is first divided into strata based on similar characteristics of the elements, not unlike in quota sampling but then the selection from each strata is random), and cluster random sampling (the population is divided into clusters — think sub-groups — and then clusters are selected at random), where the latter can be even done in several stages (called multistage cluster random sampling).

 

To conclude: ultimately, the important thing to learn here is not how the sampling is done empirically but the key difference between non-random and random sampling. Non-random/non-probability sampling methods select elements arbitrarily at researchers’ discretion, with unknown and unequal probabilities of elements to be selected; this, in turn, precludes the use of probability theory and therefore allows for only assumed (but unprovable) generalizability of the samples produced in this way.

 

On the other hand, random/probability sampling methods, in selecting elements at random, ensure that elements have equal (and therefore known) probability to be chosen; this random selection allows for the use of probability theory, the normal curve, and everything that is already mathematically proven regarding features of random variables and their probability distributions. Probability theory demonstrates that randomly selected samples (of sufficient size) are representative of and generalizable to the population from which they were drawn. Therefore, conducting a census of all elements under study becomes unnecessary as long as we are able to draw a random sample (of sufficient size) of the population.

 

At this point, (if you are still awake) you have probably noticed that I ask you to accept the fact that random samples are representative of their populations on my word, with little proof. While I will not go about proving this mathematically (and you’ll be happier for it), I will provide the theorem on which my claims are based soon enough. First, however, we still have a few other things to cover, and the logic of inference is next.

 


  1. The comprehensive list of all elements in a population is called a sampling frame. Note that in practice some sampling frames might not include all elements they purport to have. For example, using the phone book as a sampling frame for a population is a frequently used method, yet we know that some people have unlisted numbers -- or, possibly, do not have a phone -- so they are not listed in the phone book. Thus there is a difference between the population and the sampling frame for it, where the sampling frame is an approximation of, but not quite a list of the entire population.

License

Simple Stats Tools Copyright © by Mariana Gatzeva. All Rights Reserved.

Share This Book