Chapter 3 Measures of Central Tendency
3.2 Median
The three measures of central tendency are all measures that tell us where typical cases fall or where cases tend to cluster. After exploring the mode in the previous section, in this section we turn to the second measure of central tendency called the median.
The median lives up to its name: it derives from the Latin root medi, meaning “middle”, and that’s exactly the type of information it provides. Specifically, the median divides the cases of a variable into two equal halves and identifies the case in the middle. As such, it points out the “centre” of the data in a very straightforward way — it simply reports the middle observation.
Consider, however, the following point: even in everyday life, the middle implies a beginning and an end (e.g., “in the middle of the book”); something that is in-between, a gradation from a point A to a point C, as it were. From clothes sizes (“small, medium, large”) to how spicy you like your Thai food (“a little, medium, or hot”), through turning the volume up or down while listening to music (“low, medium, high”), the “centre” category bisects whatever it is applied to into a smaller/larger, less/more, left/right, etc. parts. That is, to speak of the middle of something we need to know where it starts (e.g., the minimum) and where it ends (e.g., the maximum). Simply put, we need an order.
What all this should tell you is that the median is not applicable to nominal variables. Speaking of the middle of gender, or the middle of ethnicity, or religious affiliation, or hair colour, or degree major, or of the middle of any other nominal variable makes no sense. After all, the order the categories of a nominal variable appear is either arbitrary or a matter of preference; nothing precludes rearranging the categories in some other way so that a case belonging to one category that ends up in the middle of one arrangement would not necessarily be in the middle of another arrangement. A case belonging to any category can easily end up being the middle one. A statistic shouldn’t depend on such a chance/preference; as such nominal variables have no median.
On the other hand, as you know by now, ordinal and interval/ratio variables do have an inherent order arranging their categories/values. They have a “beginning” and an “end”, and therefore a “centre”. As such, the median applies (only) to ordinal and interval/ratio variables.
Note that while the mode applies to a category (reflecting the largest number of cases), the median is determined by the case (observation) that falls in the middle of the category-ordered listing of all cases. Thus it’s not the middle category that is the median; depending on the size of the categories, the median case can belong to any category/value. The median category/value is the one to which the middle case belongs. Presented this way, the explanation sounds undoubtedly as clear as mud but do not despair. It will get better when we establish the manner in which we obtain the median, so trust me and read on.
Example 3.2 (A) Three Students, Five Students, Eight Students by Year of Study, Counting
N=3
a) Let’s say we have three students at different levels of their studies: one is a first-year, the second one a fourth-year, and the third a third-year. Before we do anything else, we need to establish the correct order. We rearrange the students properly:
(1) a first-year student
(2) a third-year student ← median
(3) a fourth-year student
The case in the middle is Case #2, the second one on the list (as there is one student below and one student above), i.e., the third-year student. Thus we have established that the median category is “third year of study”. That is, half of the students are below the third year of study and half are above (as odd as it sounds when we only have three cases).
N=5
b) What happens if I add two more students to our group, say, a first-year student and a second-year student? The order will go like this:
(1) a first-year student
(2) a first-year student (new)
(3) a second-year student (new) ← median
(4) a third-year student
(5) a fourth-year student
Once again, it’s easy to see that the middle case is Case #3, the third one on the list (as there are two students below and two students above), i.e., the second-year student. This time around the median category is “second-year of study”. That is, half of the students are below their second year of study and half are above.
N=8
c) What if I complicate matters further? What if I add three more students to the group, say, two second-years and a fourth-year? Their order will be:
(1) a first-year student
(2) a first-year student
(3) a second-year student The median is between
(4) a second-year student (new) ← this case
(5) a second-year student (new) ← and this case
(6) a third-year student
(7) a third-year student (new)
(8) a fourth-year student
If you go by the same logic as above, you’ll quickly find that there is no “middle” student: unlike before, the students now are an even number. The middle of the group actually falls between Cases #4 and Case #5, the fourth and the fifth cases on the list (so that four are below and four above it). Since both the fourth and the fifth students are second-year, we can conclude that, again, the median is “second-year of study”. Had the fourth and the fifth student been different years of study, we could say that the median was between their respective categories.
We could continue the same way as in Example 3.2 (A) above for larger groups too: we could arrange the cases in order of their categories/values, find the middle case (or two middle cases) and report its category/value as the median. However, you can guess that this would quickly become impractical the larger the group size gets. We need some other way of finding the median, one that generalizes across groups of any size.
Consider the following formula:
“numbered position of the median case in the ordered list of cases”
where, as usual, N is the group size.
Instead of counting, let’s apply this formula to Example 3.2 (A).
Example 3.2 (B) Three Students, Five Students, Eight Students by Year of Study, Using a Formula
a) N=3
(1) a first-year student
(2) a third-year student
(3) a fourth-year student
According to the formula,
That is, the “numbered position of the median case in the ordered list of cases” is equal to 2; the middle case is Case #2, the second one on the list, or like we established before, the third-year student.
b) N=5
(1) a first-year student
(2) a first-year student (new)
(3) a second-year student (new)
(4) a third-year student
(5) a fourth-year student
According to the formula,
That is, the “numbered position of the median case in the ordered list of cases” is equal to 3; the middle case is Case #3, the third one on the list, or again, the second-year student.
c) N=8
(1) a first-year student
(2) a first-year student
(3) a second-year student
(4) a second-year student (new)
(5) a second-year student (new)
(6) a third-year student
(7) a third-year student (new)
(8) a fourth-year student
According to the formula,
That is, the “numbered position of the median case in the ordered list of cases” is equal to 4.5. Considering we have discrete numbers (after all, the cases are individuals), there is no case number 4.5. Instead, we say that the median falls between Case #4 and Case #5, the fourth and fifth cases on the list, or between two second-year students, so it is “second year of study”.
It is easy to see that we could substitute a group of any size for the N in the formula. Even when working with hundreds or thousands of cases, we can always use the formula to find the place (or which case number) bisects the variable’s distribution in two haves.
So far I only used an ordinal variable to illustrate the median. How does finding the median work for interval/ratio variables? Would it matter that interval/ratio variables have numerical values rather than qualitative categories? No, not in the least. After all, finding the median doesn’t depend on the category or value of any case in any substantive sense — only on its numbered position in the ordered list of categories/values.
There is something a bit different in the way interval/ratio variables look, however, that some people seem to find a tad more confusing when working with values rather than categories. To illustrate, I’ll give you another example.
Example 3.3 (A) Median for Number of Siblings, Raw Data
Imagine you talk to seven of your friends and ask them about the number of siblings they have. Let’s say these are the responses you receive: 2, 1, 4, 2, 1, 0, 3. That is, two friends report having two siblings each, two friends report having one sibling each, and three of your friends report having four, zero, and three siblings each.
To find the median, the first thing we need to do is put the responses in order:
(1) 0
(2) 1
(3) 1
(4) 2
(5) 2
(6) 3
(7) 4
Whether you visually identify Case #4 as the middle case (three cases below and three cases above it) or use the formula () to obtain the same result, it is clear that the median is “two siblings”: half of your friends in this example have fewer than two siblings, and half have two or more siblings.
What might be confusing for some people is differentiating between the numbered positions of the cases on the list and their values since both are expressed numerically. In this example I have tried to make it easier to distinguish by putting the numbered positions of the cases in brackets and the values next to them (just like the categories in the ordinal example above). Thus you can see that Case #1 has 0 siblings, Case #2 has 1 sibling, etc. Had I chosen different set of values — for example, if Case #1 had 1 sibling, Case #2 had 2 siblings, Case #3 had 3 siblings, etc. — you might have found it a bit harder. For that reason, make a mental note to keep a clear track of what is a case’s value and what is its numbered position.