Chapter 5 The Normal Distribution and Some Basics of Probability

5.1.3 Percentiles

 

Remember quartiles? We used them in Section 4.2 to find the interquartile range (https://pressbooks.bccampus.ca/simplestats/chapter/4-2-interquartile-range/). They would split the cases in the distribution in four equal parts (i.e., in quarters) giving us a first (1 percent to 25 percent of the data), a second (26 percent to 50 percent of the data), a third (51 percent to 75 percent of the data), and a fourth quartile (76-100 percent of the data).

 

What if, instead of splitting the distribution into four equal parts, we decided to divide it into five? That would be easy: Instead of having four parts, 25 percent of the data in each, we can just have five parts, 20 percent of the data in each. Like this: 1 percent to 20 percent, 21 percent to 40 percent, 41 percent to 60 percent, 61 percent to 80 percent, and 81 percent to 100 percent. This time, we call the five equal parts quintiles (from the Latin root “quin” like quinctus, meaning five).

 

Just as easily, we can divide the distribution into ten equal parts: 1 percent to 10 percent, 11 percent to 20 percent, etc. … all the way up to the last part, 91 percent to 100 percent. Then we have ten deciles (from the Latin root “dec” like decem, meaning ten).

 

Following the same logic to the smallest possible whole number by which we can divide a distribution, we get percentiles — a distribution divided into a hundred equal parts, 1 percent in each. It turns out percentiles can be quite useful when working with a normal distribution. (You didn’t forget that’s our current topic, did you?)

 

The key piece of knowledge you need to recall from our discussion about quartiles is that to split the distribution, we need the cases lined up in order from the lowest value to the highest (or else we wouldn’t be able to speak of first, second, third or last quartiles). Applying this to the normal distribution, we might be tempted to imagine the normal curve as illustrated in Fig. 5.5 below.

 

Figure 5.5 What Percentiles Do Not Look Like

 

Fig. 5.5 lists the position of four randomly selected percentiles, had the percentiles been evenly spread over the horizontal axis. Of course, this is wrong. If we do this, we would be ignoring the actual distribution —  you know, the blue curve on the graph. After all, we have established by now that 68 percent of observations fall in the middle, within only 1 standard deviation way from the mean, where the curve is as its highest. (Recall that the height of the curve — and the fact that it’s a curve, not a line — reflects the larger frequencies of the values around the mean, and the smaller, and smaller frequencies of the values further away from the mean, in the “tails”.)

 

What this should tell you is that we can’t just assume the percentiles are uniformly spread — because the data is not. We need to account for the fact that that values in the middle are way more popular than the ones in the “tails”. Then how do we know what percentile a particular value has?

 

Again, it’s easy. We have z-scores for that. You see, every value has a z-score and the z-score reflects the percentage of cases which fall below or above that value. This is precisely the reason we know that 68 percent of the data fall within 1 standard deviation from the mean and that 95 percent of data falls within about 2 standard deviations from the mean.

 

Thus, with a normal distribution, you can turn any value into a z-score (as we saw in the previous section), and this z-score into a percentile. While there are z-score tables providing percentages associated with any z-value, the easiest way to find a percentile is through online calculators like this one: https://zscorecalculator.org/z-score-to-percentile.html.[1]  There, you can enter a z-score (make sure you choose “one-sided”) and see what percent of data falls below it (on the normal curve on the left) and what percent of data falls above it (on the normal curve on the right). The exact percentile is the number reflecting the data “below”.

 

Do It! 5.2 Finding Percentiles Using an Online Calculator

 

Using the percentile calculator linked above, you find that the percentile for z=1 is 84. Explain where this result comes from. (Hint: The mean bisects the distribution in two equal halves. A z-score of 1 is of course 1 standard deviation above the mean.)

Answer: The area below the mean is 50 percent. To that we add the 34 percent between the mean and 1 standard deviation above the mean and get 50+34=84 percent. (Since 68 percent lies between -1 and +1 standard deviations and the normal curve is symmetrical, 34 percent fall between -1 standard deviation and the mean, and 34 percent fall between the mean and +1 standard deviation).

 

 

Cool, you say (probably quite sarcastically), we now know how to find percentiles. But for what do we use them?

 

I’m glad you asked. Percentiles allow us to compare a score in relation to the rest of the data; just like z-scores, they put things into perspective. Let’s say you have 69 on a test. Turning your score into a percentile will tell you exactly what percent of the test-takers scored below you, whether it’s 35 percent (then your score wouldn’t be considered too impressive) or 99 percent (which would be most impressive, seeing how you’d be in the top 1 percent of test-takers) or any other percent it might be.[2]

 

Let’s make sure you understand all that, shall we?

 

Do It! 5.3 Hourly Wage

 

Imagine you have applied for a job and your employer offers you \$13.5/hour. You also learn that the average hourly wage your potential employer pays to their employees is \$17.5/hour with a standard deviation of \$2.5/hour. See if this is a generous offer (after all, you would be just starting) by finding its z-score and percentile and comparing it to how the other employees of the company are fairing. (Don’t forget to interpret both the percentile and the z-score.)

Answer: = -1.6, percentile = 5.5. Only 5.5 percent of the employees in the company receive less than \$13.5/hour; almost 95 percent of the employees receive more, so no, it’s not a generous offer at all.

 

 

And now that you might be starting to feel somewhat comfortable with the uses of the normal distribution, I’ll pull the rug a bit from under you, as it were. Recall how I started the chapter by explaining that many real-world interval/ratio variables tend to be approximately normally distributed? (That part’s true.) And then we talked about where the variable’s observations fall in the normal distribution? Well, there I lied. (It was necessary!)

 

If you think about it carefully, both statements cannot be true. On the one hand, a real-existing variable has a specific distribution — an approximately normal one. But would two real-existing variables have exactly the same approximately normal distribution? That would be unlikely, considering that different variables, in different datasets, with different number of observations, units of measurements, units of analysis, means and standard deviations, etc. cannot possibly look exactly the same if plotted on a histogram. How then do we get these very fixed and very specific numbers and percentages associated with the z-scores and the percentiles?

 

The thing is, everything I told you about the normal distribution, starting with its defining features and ending with the z-scores and percentiles, refers to the ideal-type, only-existing-in-theory, perfect normal distribution. All the numbers and calculations and percentages we discussed reflect the theoretical normal distribution; they serve as a sort of expectation of how a (continuous, random)[3] variable is expected to be distributed. Of course, real-existing variables generally fall short of this ideal, and therefore we call their distributions approximately normal.

 

I repeat: the theoretical (perfect) normal distribution provides us with what we can expect the actual frequencies of the variable’s values to be, in theory. (In reality, the distribution differs from that expectation to varying degrees). It turns out, when we work with z-scores and associated percentages and percentiles, we work with what is expected, not with what is. (The variables’ observed distributions differ but the normal — expected —  distribution is always the same.)

 

What do we do then, with this reality versus expectation we have here? Why did we learn all we did about the normal distribution if “it isn’t real”?[4]

 

This is where probability comes in. Hold the thought about the normal distribution being an expectation; we’ll come back to it in the remaining sections of this chapter.


  1. For that matter, you can use an online calculator to find the z-score of any value. You can try one here: https://zscorecalculator.org.
  2. This is exactly what standardized tests (e.g., SAT) do to interpret individual scores. They provide percentiles so that any test-taker can find how they did relative to others (i.e., it provides the place of a score in the overall distribution of scores).
  3. I explain randomness a bit in the next section, and further in Chapter 6. For now, know that in statistics it doesn't mean "arbitrary" or "accidental" but rather "obtained in an unbiased way" (i.e., with every element having an equal chance to be picked).
  4. That said, again, some standardized tests can be designed in such a way that their test scores to be distributed normally. Thus, real-existing data can have a normal distribution, it's just that usually it's an approximation.

License

Simple Stats Tools Copyright © by Mariana Gatzeva. All Rights Reserved.

Share This Book