{"id":1601,"date":"2019-08-13T17:34:51","date_gmt":"2019-08-13T21:34:51","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/simplestats\/?post_type=chapter&#038;p=1601"},"modified":"2019-08-13T19:03:36","modified_gmt":"2019-08-13T23:03:36","slug":"3-6-outliers","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/simplestats\/chapter\/3-6-outliers\/","title":{"raw":"3.6 Outliers","rendered":"3.6 Outliers"},"content":{"raw":"[latexpage]\r\n\r\nOut of the three measures of central tendency, the mean is the only one that takes into account the actual numerical values of the cases. As such, it is easily affected by the size of the values: a sequence of numbers such as \"1, 5, 7, 10, 15\" will produce a smaller mean than a sequence of numbers like \"100, 50, 75, 130, 90\".\r\n\r\n&nbsp;\r\n\r\nWhen all values to be averaged are of relatively comparable magnitude, the mean does a good job at reflecting the central tendency of a variable -- that is why it is the most familiar and widely used measure. However, <strong>when a variable contains an extremely small or an extremely large value (or several values) compared to the rest of the values, the mean gets easily distorted<\/strong> and stops reflecting the central tendency \"truthfully\", as it were. <strong>Extremely small and extremely large values are called statistical\u00a0<em>outliers<\/em>.<\/strong>\r\n\r\n&nbsp;\r\n\r\nWhile there is a convenient method for identifying outliers (using a concept\u00a0<span style=\"font-size: 14pt;text-indent: 18.6667px\">called\u00a0<\/span><em style=\"font-size: 14pt;text-indent: 18.6667px\">interquartile range\u00a0<\/em>which\u00a0<span style=\"text-indent: 1em;font-size: 14pt\">we will discuss in the next chapter<\/span><span style=\"text-indent: 1em;font-size: 14pt\">), at this stage it is not necessary that you be so technical. You can visually identify outliers, albeit less precisely, by the \"disturbance\" in the general pattern of the data you observe. For example, if you have values like \"1, 5, 7, 10, 15\", a value of 130 in that sequence would be considered an outlier. Similarly, if you have values like \"100, 80, 75, 130, 90\", a value of 5 would be an outlier.<\/span>\r\n\r\n&nbsp;\r\n\r\nLet's calculate the means of the two sequences, first with and then without the so-called outliers and see what happens.\r\n\r\n&nbsp;\r\n\r\nThe first sequence is 1, 5, 7, 10, 15 and we want to see what happens when we add 130.\r\n\r\n&nbsp;\r\n\r\n$$\\frac{(1+5+7+10+15)}{5}=\\frac{38}{5}=7.6$$\r\n\r\n&nbsp;\r\n\r\nWe add 130 to the sequence:\r\n\r\n&nbsp;\r\n\r\n$$\\frac{(1+5+7+10+15+130)}{6}=\\frac{168}{6}=28$$\r\n\r\n&nbsp;\r\n\r\nBoth means, 7.6 and 28, are the true averages of the sequences of values as listed. However, the addition of an uncommonly large number \"pulled\" the mean away from the \"centre\" of the original data.\r\n\r\n&nbsp;\r\n\r\nHow truthfully does 28 represent the \"centre\" of a sequence where the majority of the cases's values (in fact, five out of the six values) are 15 and below? Not that much.[footnote]If you believe it's not the magnitude of the value but just its addition that causes the \"pulling\" of the mean, consider redoing the example with adding 18, instead of 130. Then we have $\\frac{(1+5+7+10+15+18)}{6}=\\frac{56}{6}=9.3$. The \"pull\" from 7.6 to 9.3 is much smaller than from 7.6 to 28. The value 9.3 reflects the central tendency of the data more truthfully than 28 does.[\/footnote]\r\n\r\n&nbsp;\r\n\r\nTo demonstrate the effect of an extremely small value, we continuing with the next sequence:\r\n\r\n&nbsp;\r\n\r\n$$\\frac{(100+80+75+130+90)}{5}=\\frac{475}{5}=95$$\r\n\r\n&nbsp;\r\n\r\nAdding a value of 5 to the sequence produces the following:\r\n\r\n&nbsp;\r\n\r\n$$\\frac{(100+80+75+130+90+5)}{6}=\\frac{460}{6}=80$$\r\n\r\n&nbsp;\r\n\r\nSimilarly as with the effect on the mean of the first sequence, the mean here gets \"pulled\", but in the opposite direction, from 95 to 80. Both means are technically true averages of their respective values but the latter one is \"artificially\" low: after all, four out of the six values are the same or higher.[footnote]Again, if we added a value of a comparable size to this sequence instead of 5, the mean would not be impacted as much: $\\frac{(100+80+75+130+90+70)}{6}=\\frac{545}{6}=90.8.$ Consider the \"pull\" from 95 to 80 vs. from 95 to 90.8.[\/footnote]\r\n\r\n&nbsp;\r\n\r\nWhat this tells you is that <strong>the mean is an unstable measure of central tendency, prone to being affected by outliers.<\/strong> Contrast this to what you know about the median: the median does not take the magnitude of the values into consideration, beyond their order. Thus, as explained in the previous Section 3.3 (https:\/\/pressbooks.bccampus.ca\/simplestats\/chapter\/3-3-the-median-with-frequency-tables\/), adding a value (be it extremely small or extremely large) to a sequence does not affect the median much -- unlike the mean. The median of 1, 5, 7, 10, 15 is 7 (there are two values above and two below it), and whether we add 130 or 18, it doesn't matter: it's just an additional value in the sequence.[footnote]The median of 1, 5, 7, 10, 15, 18 is between 7 and 10, i.e., 8.5 (since we need the half-way distance between 7 and 10, we use the average of 7 and 10, that is 7+10=17 and divide it by 2 to get 8.5).\u00a0 The median of 1, 5, 7, 10, 15, 130 is exactly the same -- it is still half-way between the two middle values, 7 and 10, or again 8.5.\u00a0[\/footnote]\r\n\r\n&nbsp;\r\n\r\nSince the mean is prone to being affected by outliers, while the median is not,<strong> in some situations it is advisable to report the median as a more \"valid\" measure of the typical cases\/\"centre\" of the data rather than the mean.<\/strong>\u00a0Specifically, watch out for reports on average income, average age, average weight, etc. where a few outliers can <em>skew<\/em> a variable's distribution.\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--learning-objectives\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch Out!! #8<\/strong><\/span> ... for Reports on Averages of Variables Prone to Skewing by Outliers<\/em><\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n&nbsp;\r\n\r\nImagine a small company advertising an open position by claiming that the average salary of their employees is 100 thousand dollars per year. For simplicity's sake, let's assume the company has ten employees and these are their salaries:\r\n\r\n<\/div>\r\n<em>Table 3.8 Employee Salaries (Hypothetical Data)\u00a0<\/em>\r\n<div class=\"textbox__content\">\r\n<table style=\"border-collapse: collapse;width: 100%\" border=\"0\">\r\n<tbody>\r\n<tr>\r\n<td style=\"width: 50%;text-align: left\"><strong>Value (in thousands)<\/strong><\/td>\r\n<td style=\"width: 50%;text-align: left\"><strong>Frequency<\/strong><\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 50%\">70<\/td>\r\n<td style=\"width: 50%\">5<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 50%\">87.5<\/td>\r\n<td style=\"width: 50%\">4<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 50%\">300<\/td>\r\n<td style=\"width: 50%\">1<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 50%\"><strong>TOTAL<\/strong><\/td>\r\n<td style=\"width: 50%\"><strong>10<\/strong><\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nYou can check for yourself what the average annual salary is:\r\n\r\n&nbsp;\r\n\r\n$$\\frac{\\sum\\limits_{i=1}^{N}{x_i}}{N}=\\frac{70(5)+87.5(4)+300(1)}{10}=\\frac{350+350+300}{10}= \\frac{1000}{10}=100$$\r\n\r\n&nbsp;\r\n\r\nor, indeed, 100 thousand dollars. However, how representative this annual salary is for the regular employee? After all, nine out of ten employees of the company get less than that. The average annual salary reported is inflated by the very high salary of one employee (perhaps the manager), a clear outlier.\r\n\r\n&nbsp;\r\n\r\nLet's instead look at the median. We start by arrange the values in order:\r\n\r\n&nbsp;\r\n\r\n70, 70, 70, 70, 70, 87.5, 87.5, 87.5, 87.5, 300\r\n\r\n&nbsp;\r\n\r\nUsing the formula for finding the position of the median, we have\r\n\r\n&nbsp;\r\n\r\n$$\\frac{(N+1)}{2}=\\frac{(10+1)}{2}=\\frac{11}{2}=5.5$$\r\n\r\n&nbsp;\r\n\r\nI.e., we find that the median falls between the fifth and the sixth value in the order, or between 70 and 87.5. The halfway point between these two values is found by averaging them:\r\n\r\n&nbsp;\r\n\r\n$$\\frac{(70+87.5)}{2}=\\frac{157.5}{2}=78.75$$\r\n\r\n&nbsp;\r\n\r\nwhich shows us that the median annual salary of the employees in that company is \\$78,750. This is a lot less than the touted average of \\$100,000 and a lot more reflective of what nine out of ten employees receive.\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\n<\/div>\r\n&nbsp;\r\n\r\nExamples like the <em>Watch Out!! #8<\/em> above show that relying on the mean can be tricky, and in some cases can be deliberately used to \"lie with statistics\" (i.e., a report might be technically correct but at the same time very misleading).\u00a0<span style=\"text-indent: 18.6667px;font-size: 14pt\">Thus, <strong>generally reporting all three central tendency measures is the way to go<\/strong> and you, as a beginner researchers should do that.<\/span>\r\n\r\n&nbsp;\r\n\r\nFinally, you can observe a skew in the data even visually by looking at an interval\/ratio variable's graphical representation, i.e., its histogram. Extremely high values tend to \"pull\" the mean to the right of the \"centre\", i.e., with the majority of cases being relatively smaller, the few high values will produce a \"tail\" on the right side of the distribution (a.k.a. <em>positive skew<\/em>). On the other hand, extremely low values tend to \"pull\" the mean to the left of the \"centre\", i.e., with the majority of cases being relatively larger, the few low values will produce a \"tail\" on the left side of the distribution (a.k.a. <em>negative skew<\/em>).\r\n\r\n&nbsp;\r\n\r\n<span style=\"text-indent: 1em;font-size: 14pt\">As well, since the median indicates the \"centre\" of the data better, a mean smaller than the median would typically indicate a negative\/left skew, while a mean larger than the median would typically indicate a positive\/right skew. When you observe a skew in the data, the median would typically be a the preferred measure of central tendency.<\/span>\r\n\r\n&nbsp;\r\n\r\nObserve the positive skew in Fig. 3.2 below.\r\n\r\n&nbsp;\r\n\r\n<em>Figure 3.1 Number of Cigarettes Smoked Per Day by Occasional Smokers (CCHS 15\/16)<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs1.png\" alt=\"\" width=\"462\" height=\"370\" class=\"alignnone wp-image-1610 size-full\" \/>\r\n\r\nThe reason the numbers on the horizontal axis reach as high as 100 despite the fact that there appears to be nothing there is because there is at least one outlier case -- a respondent who said they were an occasional smoker but reported smoking 99 cigarettes per day.[footnote]Whether this is to be believed is not important here, just the fact that such a value exists in the data. You will learn what is to be done about outliers in statistical analysis in Chapter 4.[\/footnote] Thus the distribution has a long right-side \"tail\", as it were, which you can better see in Fig. 3.2 providing the \"zoomed-in\" version of the histogram above. (The \"tail\" is what you will have if you trace an imaginary line through the tops of all the bars in the histogram down to the single case of 99 cigarettes per day.)\r\n\r\n&nbsp;\r\n\r\n<em>Figure 3.2<\/em>\u00a0<em>Number of Cigarettes Smoked Per Day by Occasional Smokers (CCHS 15\/16), Zoomed<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs-zoomed.png\" alt=\"\" width=\"462\" height=\"370\" class=\"alignnone wp-image-1609 size-full\" \/>\r\n\r\nIn this case the median is 3 cigarettes smoked per day by an occasional smoker. The mean is 4.33, and as expected, it is larger than the median.\r\n\r\n&nbsp;\r\n\r\nSimilarly, an exceptionally small value compared to the bulk of the cases will produce a negatively-skewed histogram where the distribution has a \"tail\" but on the left of where most cases are. In that case the mean will be smaller than the median.","rendered":"<p>Out of the three measures of central tendency, the mean is the only one that takes into account the actual numerical values of the cases. As such, it is easily affected by the size of the values: a sequence of numbers such as &#8220;1, 5, 7, 10, 15&#8221; will produce a smaller mean than a sequence of numbers like &#8220;100, 50, 75, 130, 90&#8221;.<\/p>\n<p>&nbsp;<\/p>\n<p>When all values to be averaged are of relatively comparable magnitude, the mean does a good job at reflecting the central tendency of a variable &#8212; that is why it is the most familiar and widely used measure. However, <strong>when a variable contains an extremely small or an extremely large value (or several values) compared to the rest of the values, the mean gets easily distorted<\/strong> and stops reflecting the central tendency &#8220;truthfully&#8221;, as it were. <strong>Extremely small and extremely large values are called statistical\u00a0<em>outliers<\/em>.<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p>While there is a convenient method for identifying outliers (using a concept\u00a0<span style=\"font-size: 14pt;text-indent: 18.6667px\">called\u00a0<\/span><em style=\"font-size: 14pt;text-indent: 18.6667px\">interquartile range\u00a0<\/em>which\u00a0<span style=\"text-indent: 1em;font-size: 14pt\">we will discuss in the next chapter<\/span><span style=\"text-indent: 1em;font-size: 14pt\">), at this stage it is not necessary that you be so technical. You can visually identify outliers, albeit less precisely, by the &#8220;disturbance&#8221; in the general pattern of the data you observe. For example, if you have values like &#8220;1, 5, 7, 10, 15&#8221;, a value of 130 in that sequence would be considered an outlier. Similarly, if you have values like &#8220;100, 80, 75, 130, 90&#8221;, a value of 5 would be an outlier.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>Let&#8217;s calculate the means of the two sequences, first with and then without the so-called outliers and see what happens.<\/p>\n<p>&nbsp;<\/p>\n<p>The first sequence is 1, 5, 7, 10, 15 and we want to see what happens when we add 130.<\/p>\n<p>&nbsp;<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-226c4c0f742079d284a4f5304bc561d1_l3.png\" height=\"38\" width=\"256\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#43;&#53;&#43;&#55;&#43;&#49;&#48;&#43;&#49;&#53;&#41;&#125;&#123;&#53;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#51;&#56;&#125;&#123;&#53;&#125;&#61;&#55;&#46;&#54;&#92;&#093;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>We add 130 to the sequence:<\/p>\n<p>&nbsp;<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-b4dbba500f0c060d45027a11a6078d10_l3.png\" height=\"38\" width=\"308\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#43;&#53;&#43;&#55;&#43;&#49;&#48;&#43;&#49;&#53;&#43;&#49;&#51;&#48;&#41;&#125;&#123;&#54;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#54;&#56;&#125;&#123;&#54;&#125;&#61;&#50;&#56;&#92;&#093;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Both means, 7.6 and 28, are the true averages of the sequences of values as listed. However, the addition of an uncommonly large number &#8220;pulled&#8221; the mean away from the &#8220;centre&#8221; of the original data.<\/p>\n<p>&nbsp;<\/p>\n<p>How truthfully does 28 represent the &#8220;centre&#8221; of a sequence where the majority of the cases&#8217;s values (in fact, five out of the six values) are 15 and below? Not that much.<a class=\"footnote\" title=\"If you believe it's not the magnitude of the value but just its addition that causes the &quot;pulling&quot; of the mean, consider redoing the example with adding 18, instead of 130. Then we have . The &quot;pull&quot; from 7.6 to 9.3 is much smaller than from 7.6 to 28. The value 9.3 reflects the central tendency of the data more truthfully than 28 does.\" id=\"return-footnote-1601-1\" href=\"#footnote-1601-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>To demonstrate the effect of an extremely small value, we continuing with the next sequence:<\/p>\n<p>&nbsp;<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-443516d1b781dca6651080fb3273a696_l3.png\" height=\"38\" width=\"303\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#48;&#48;&#43;&#56;&#48;&#43;&#55;&#53;&#43;&#49;&#51;&#48;&#43;&#57;&#48;&#41;&#125;&#123;&#53;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#52;&#55;&#53;&#125;&#123;&#53;&#125;&#61;&#57;&#53;&#92;&#093;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Adding a value of 5 to the sequence produces the following:<\/p>\n<p>&nbsp;<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-755dc260527019358c957e64e00a8936_l3.png\" height=\"38\" width=\"335\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#48;&#48;&#43;&#56;&#48;&#43;&#55;&#53;&#43;&#49;&#51;&#48;&#43;&#57;&#48;&#43;&#53;&#41;&#125;&#123;&#54;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#52;&#54;&#48;&#125;&#123;&#54;&#125;&#61;&#56;&#48;&#92;&#093;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Similarly as with the effect on the mean of the first sequence, the mean here gets &#8220;pulled&#8221;, but in the opposite direction, from 95 to 80. Both means are technically true averages of their respective values but the latter one is &#8220;artificially&#8221; low: after all, four out of the six values are the same or higher.<a class=\"footnote\" title=\"Again, if we added a value of a comparable size to this sequence instead of 5, the mean would not be impacted as much:  Consider the &quot;pull&quot; from 95 to 80 vs. from 95 to 90.8.\" id=\"return-footnote-1601-2\" href=\"#footnote-1601-2\" aria-label=\"Footnote 2\"><sup class=\"footnote\">[2]<\/sup><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>What this tells you is that <strong>the mean is an unstable measure of central tendency, prone to being affected by outliers.<\/strong> Contrast this to what you know about the median: the median does not take the magnitude of the values into consideration, beyond their order. Thus, as explained in the previous Section 3.3 (https:\/\/pressbooks.bccampus.ca\/simplestats\/chapter\/3-3-the-median-with-frequency-tables\/), adding a value (be it extremely small or extremely large) to a sequence does not affect the median much &#8212; unlike the mean. The median of 1, 5, 7, 10, 15 is 7 (there are two values above and two below it), and whether we add 130 or 18, it doesn&#8217;t matter: it&#8217;s just an additional value in the sequence.<a class=\"footnote\" title=\"The median of 1, 5, 7, 10, 15, 18 is between 7 and 10, i.e., 8.5 (since we need the half-way distance between 7 and 10, we use the average of 7 and 10, that is 7+10=17 and divide it by 2 to get 8.5).\u00a0 The median of 1, 5, 7, 10, 15, 130 is exactly the same -- it is still half-way between the two middle values, 7 and 10, or again 8.5.\u00a0\" id=\"return-footnote-1601-3\" href=\"#footnote-1601-3\" aria-label=\"Footnote 3\"><sup class=\"footnote\">[3]<\/sup><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Since the mean is prone to being affected by outliers, while the median is not,<strong> in some situations it is advisable to report the median as a more &#8220;valid&#8221; measure of the typical cases\/&#8221;centre&#8221; of the data rather than the mean.<\/strong>\u00a0Specifically, watch out for reports on average income, average age, average weight, etc. where a few outliers can <em>skew<\/em> a variable&#8217;s distribution.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--learning-objectives\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch Out!! #8<\/strong><\/span> &#8230; for Reports on Averages of Variables Prone to Skewing by Outliers<\/em><\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>&nbsp;<\/p>\n<p>Imagine a small company advertising an open position by claiming that the average salary of their employees is 100 thousand dollars per year. For simplicity&#8217;s sake, let&#8217;s assume the company has ten employees and these are their salaries:<\/p>\n<\/div>\n<p><em>Table 3.8 Employee Salaries (Hypothetical Data)\u00a0<\/em><\/p>\n<div class=\"textbox__content\">\n<table style=\"border-collapse: collapse;width: 100%\">\n<tbody>\n<tr>\n<td style=\"width: 50%;text-align: left\"><strong>Value (in thousands)<\/strong><\/td>\n<td style=\"width: 50%;text-align: left\"><strong>Frequency<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%\">70<\/td>\n<td style=\"width: 50%\">5<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%\">87.5<\/td>\n<td style=\"width: 50%\">4<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%\">300<\/td>\n<td style=\"width: 50%\">1<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%\"><strong>TOTAL<\/strong><\/td>\n<td style=\"width: 50%\"><strong>10<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>You can check for yourself what the average annual salary is:<\/p>\n<p>&nbsp;<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 63px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-f2f27c58b9bb76f63af8334702ea65ee_l3.png\" height=\"63\" width=\"522\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#92;&#115;&#117;&#109;&#92;&#108;&#105;&#109;&#105;&#116;&#115;&#95;&#123;&#105;&#61;&#49;&#125;&#94;&#123;&#78;&#125;&#123;&#120;&#95;&#105;&#125;&#125;&#123;&#78;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#55;&#48;&#40;&#53;&#41;&#43;&#56;&#55;&#46;&#53;&#40;&#52;&#41;&#43;&#51;&#48;&#48;&#40;&#49;&#41;&#125;&#123;&#49;&#48;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#51;&#53;&#48;&#43;&#51;&#53;&#48;&#43;&#51;&#48;&#48;&#125;&#123;&#49;&#48;&#125;&#61;&#32;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#48;&#48;&#48;&#125;&#123;&#49;&#48;&#125;&#61;&#49;&#48;&#48;&#92;&#093;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>or, indeed, 100 thousand dollars. However, how representative this annual salary is for the regular employee? After all, nine out of ten employees of the company get less than that. The average annual salary reported is inflated by the very high salary of one employee (perhaps the manager), a clear outlier.<\/p>\n<p>&nbsp;<\/p>\n<p>Let&#8217;s instead look at the median. We start by arrange the values in order:<\/p>\n<p>&nbsp;<\/p>\n<p>70, 70, 70, 70, 70, 87.5, 87.5, 87.5, 87.5, 300<\/p>\n<p>&nbsp;<\/p>\n<p>Using the formula for finding the position of the median, we have<\/p>\n<p>&nbsp;<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-8181f1b05a93520f17b4fb14b621d567_l3.png\" height=\"38\" width=\"243\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#78;&#43;&#49;&#41;&#125;&#123;&#50;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#48;&#43;&#49;&#41;&#125;&#123;&#50;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#49;&#125;&#123;&#50;&#125;&#61;&#53;&#46;&#53;&#92;&#093;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>I.e., we find that the median falls between the fifth and the sixth value in the order, or between 70 and 87.5. The halfway point between these two values is found by averaging them:<\/p>\n<p>&nbsp;<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-3e1b7097bb8cf3561d514b91c3ea55de_l3.png\" height=\"38\" width=\"218\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#55;&#48;&#43;&#56;&#55;&#46;&#53;&#41;&#125;&#123;&#50;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#53;&#55;&#46;&#53;&#125;&#123;&#50;&#125;&#61;&#55;&#56;&#46;&#55;&#53;&#92;&#093;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>which shows us that the median annual salary of the employees in that company is &#36;78,750. This is a lot less than the touted average of &#36;100,000 and a lot more reflective of what nine out of ten employees receive.<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Examples like the <em>Watch Out!! #8<\/em> above show that relying on the mean can be tricky, and in some cases can be deliberately used to &#8220;lie with statistics&#8221; (i.e., a report might be technically correct but at the same time very misleading).\u00a0<span style=\"text-indent: 18.6667px;font-size: 14pt\">Thus, <strong>generally reporting all three central tendency measures is the way to go<\/strong> and you, as a beginner researchers should do that.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>Finally, you can observe a skew in the data even visually by looking at an interval\/ratio variable&#8217;s graphical representation, i.e., its histogram. Extremely high values tend to &#8220;pull&#8221; the mean to the right of the &#8220;centre&#8221;, i.e., with the majority of cases being relatively smaller, the few high values will produce a &#8220;tail&#8221; on the right side of the distribution (a.k.a. <em>positive skew<\/em>). On the other hand, extremely low values tend to &#8220;pull&#8221; the mean to the left of the &#8220;centre&#8221;, i.e., with the majority of cases being relatively larger, the few low values will produce a &#8220;tail&#8221; on the left side of the distribution (a.k.a. <em>negative skew<\/em>).<\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"text-indent: 1em;font-size: 14pt\">As well, since the median indicates the &#8220;centre&#8221; of the data better, a mean smaller than the median would typically indicate a negative\/left skew, while a mean larger than the median would typically indicate a positive\/right skew. When you observe a skew in the data, the median would typically be a the preferred measure of central tendency.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>Observe the positive skew in Fig. 3.2 below.<\/p>\n<p>&nbsp;<\/p>\n<p><em>Figure 3.1 Number of Cigarettes Smoked Per Day by Occasional Smokers (CCHS 15\/16)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs1.png\" alt=\"\" width=\"462\" height=\"370\" class=\"alignnone wp-image-1610 size-full\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs1.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs1-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs1-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs1-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs1-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>The reason the numbers on the horizontal axis reach as high as 100 despite the fact that there appears to be nothing there is because there is at least one outlier case &#8212; a respondent who said they were an occasional smoker but reported smoking 99 cigarettes per day.<a class=\"footnote\" title=\"Whether this is to be believed is not important here, just the fact that such a value exists in the data. You will learn what is to be done about outliers in statistical analysis in Chapter 4.\" id=\"return-footnote-1601-4\" href=\"#footnote-1601-4\" aria-label=\"Footnote 4\"><sup class=\"footnote\">[4]<\/sup><\/a> Thus the distribution has a long right-side &#8220;tail&#8221;, as it were, which you can better see in Fig. 3.2 providing the &#8220;zoomed-in&#8221; version of the histogram above. (The &#8220;tail&#8221; is what you will have if you trace an imaginary line through the tops of all the bars in the histogram down to the single case of 99 cigarettes per day.)<\/p>\n<p>&nbsp;<\/p>\n<p><em>Figure 3.2<\/em>\u00a0<em>Number of Cigarettes Smoked Per Day by Occasional Smokers (CCHS 15\/16), Zoomed<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs-zoomed.png\" alt=\"\" width=\"462\" height=\"370\" class=\"alignnone wp-image-1609 size-full\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs-zoomed.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs-zoomed-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs-zoomed-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs-zoomed-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/08\/right-skew-number-cigarettes-cchs-zoomed-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>In this case the median is 3 cigarettes smoked per day by an occasional smoker. The mean is 4.33, and as expected, it is larger than the median.<\/p>\n<p>&nbsp;<\/p>\n<p>Similarly, an exceptionally small value compared to the bulk of the cases will produce a negatively-skewed histogram where the distribution has a &#8220;tail&#8221; but on the left of where most cases are. In that case the mean will be smaller than the median.<\/p>\n<hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-1601-1\">If you believe it's not the magnitude of the value but just its addition that causes the \"pulling\" of the mean, consider redoing the example with adding 18, instead of 130. Then we have <img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-e851ca93b6d8a696680c1335b40abc67_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#43;&#53;&#43;&#55;&#43;&#49;&#48;&#43;&#49;&#53;&#43;&#49;&#56;&#41;&#125;&#123;&#54;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#53;&#54;&#125;&#123;&#54;&#125;&#61;&#57;&#46;&#51;\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"216\" style=\"vertical-align: -6px;\" \/>. The \"pull\" from 7.6 to 9.3 is much smaller than from 7.6 to 28. The value 9.3 reflects the central tendency of the data more truthfully than 28 does. <a href=\"#return-footnote-1601-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><li id=\"footnote-1601-2\">Again, if we added a value of a comparable size to this sequence instead of 5, the mean would not be impacted as much: <img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/ql-cache\/quicklatex.com-868077d1a641aa3af1714aab65943231_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#49;&#48;&#48;&#43;&#56;&#48;&#43;&#55;&#53;&#43;&#49;&#51;&#48;&#43;&#57;&#48;&#43;&#55;&#48;&#41;&#125;&#123;&#54;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#53;&#52;&#53;&#125;&#123;&#54;&#125;&#61;&#57;&#48;&#46;&#56;&#46;\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"270\" style=\"vertical-align: -6px;\" \/> Consider the \"pull\" from 95 to 80 vs. from 95 to 90.8. <a href=\"#return-footnote-1601-2\" class=\"return-footnote\" aria-label=\"Return to footnote 2\">&crarr;<\/a><\/li><li id=\"footnote-1601-3\">The median of 1, 5, 7, 10, 15, 18 is between 7 and 10, i.e., 8.5 (since we need the half-way distance between 7 and 10, we use the average of 7 and 10, that is 7+10=17 and divide it by 2 to get 8.5).\u00a0 The median of 1, 5, 7, 10, 15, 130 is exactly the same -- it is still half-way between the two middle values, 7 and 10, or again 8.5.\u00a0 <a href=\"#return-footnote-1601-3\" class=\"return-footnote\" aria-label=\"Return to footnote 3\">&crarr;<\/a><\/li><li id=\"footnote-1601-4\">Whether this is to be believed is not important here, just the fact that such a value exists in the data. You will learn what is to be done about outliers in statistical analysis in Chapter 4. <a href=\"#return-footnote-1601-4\" class=\"return-footnote\" aria-label=\"Return to footnote 4\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":533,"menu_order":6,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-1601","chapter","type-chapter","status-publish","hentry"],"part":24,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/1601","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/users\/533"}],"version-history":[{"count":10,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/1601\/revisions"}],"predecessor-version":[{"id":1616,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/1601\/revisions\/1616"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/parts\/24"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/1601\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/media?parent=1601"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapter-type?post=1601"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/contributor?post=1601"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/license?post=1601"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}