8 Hypotheses Testing

8.3 Hypothesis Testing

The first thing you should know about testing hypotheses is their relationship to statistical inference: We formulate hypotheses about the population of interest, and only about the population of interest. We test them through sample data.

 

Like so: imagine I have read enough on the topic of the gender gap that I hypothesize that women and men receive different income on average. I explore my sample data and I do find that in the sample with which I’m working men have a higher average income than women. It seems like there is an association between gender an income; however, I do not know if there is an association between gender and income in the population in general. To that effect, I want to estimate (with a given level of certainty) whether such an association exists in the population. My hypothesis is about the gender/income association in the population. (After all, I can see the different average income levels in the sample; there is no need to hypothesize about the sample.)

 

You may be getting tired of my italicizing “the population” but it really is that important: hypotheses are stated about the population. This is key for testing, so keep it in mind.

 

If the test provides us with evidence in support of our alternative hypothesis, we call the association being tested statistically significant[1].

 

Before we get to the nitty-gritty details of hypotheses testing, here’s an overview to show you the underlying logic of how it all works:

  1. State the null and the alternative hypotheses;
  2. Assuming the null hypothesis as “true”, calculate the related score (e.g., z-value, t-value, etc.);
  3. Find the probability associated with that score (essentially the probability that the null hypothesis is indeed “true”, called p-value).
  4. If that probability is low enough (i.e., below the level of significance, explained below), reject the null hypothesis; if the probability is too high (above the level of significance), fail to reject the null hypothesis.
  5. If the null hypothesis has been rejected, you have found support for the alternative hypothesis: your bivariate association is statistically significant, and therefore generalizable to the population.
  6. If the null hypothesis has not been rejected, you have found no support for the alternative hypothesis: your bivariate association is not statistically significant, and is perhaps due to expected sampling variability (i.e., to random error) appearing in this one particular sample.

 

Example 8.2 below illustrates the whole process in detail. As with applying the Central Limit Theorem  to confidence intervals, it is easier to start with an example where we assume we have the population parameters. Once you grasp the underlying logic, we can move on to properly testing bivariate associations. (The example below is for heuristic purposes only.)

 

Example 8.2 (A) Employee Productivity (Finding Statistically Significant Results, N=100)

 

Imagine a large company has created a productivity index to measure its employees’ productivity. The (interval scale) index is constructed to be normally distributed, with a mean of 600 points and a standard deviation of 100 points.

 

Imagine further that a hundred of the company’s employees were randomly selected to attend a new specialized training course, after which their average productivity score was measured as 650 points. (To simplify things, we’ll also assume that their standard deviation is the same as the general group of employees.) Can we conclude that the training course had indeed increased productivity? Or is the gain of 50 points something due to regular sampling variability? That is — is this 50-points gain statistically significant?

 

Here’s what we have, formally stated:

 

\mu=600

\sigma=100

\overline{x}=650

N=100

 

What we want to know is the probability of a score of 650 if the training course didn’t contribute to the gain, i.e., the probability of a score of 650 under the condition of the null hypothesis.

  • H0: The training course did not affect productivity (the 650 score was due to random chance); \mu_\overline{x} =\mu. The true/population mean of the trained is the same as that of the untrained employees.
  • Ha: The training course affected productivity (the 650 score was a true gain);           \mu_\overline{x} \neq\mu. The true/population mean of the trained is nor the same as the population mean of the untrained employees.

Recall from Chapter 5 and Chapter 6 that to obtain the probability of a score we need to express it in terms of standard deviations (i.e., here in standard errors, as we are working with a sampling distribution).

 

The standard error is:

 

\sigma_\overline{x} =\frac{\sigma}{\sqrt{N}}=\frac{100}{\sqrt{100}}=\frac{100}{10}=10

 

Then the z-value of 650 is:

 

z=\frac{\overline{x}-\mu}{\sigma_\overline{x}} =\frac{650-600}{10}=\frac{50}{10}=5

 

That is, the trained group’s mean of 650 is five standard errors above the ‘general’ (not-trained employees’) mean of 600. Considering that we know that 99 percent of the time a sample mean will fall within 3 standard errors away from the population mean, the probability for the trained group’s mean in the sample to be 5 standard errors above the mean of everyone else is extremely small (smaller than 0.5% to be exact, as explained below). Given the properties of the normal curve, we know that 68 percent of all means in infinite sampling will fall between ±1 standard error (i.e, between 590 and 610), 95 percent will fall between ±1.96 standard errors (i.e., approximately between 580 and 620), and 99 percent will fall between ±2.58 standard errors (i.e., approximately between 570 and 630). The score of 650 which is 5 standard errors above the mean indeed would fall very, very far in the right tail.

 

In terms of probabilities, consider the following: if a sample mean has a 99 percent probability of being approximately between 570 and 630, and the remaining 1% is distributed equally in the two tails, the probability beyond 630 is 0.5%. Assuming the null hypothesis were true (i.e., training had no effect and we see the 650 by chance instead of a real increase while the true/population mean of the trained is 600), our calculations show that the 650 score then appears with a probability of p<0.005[2] — a very small probability, so small that a score of 650 seems highly unusual.

 

And this is where the crux of the logic of hypotheses testing lies: the chance of the 100 employees getting an average productivity score of 650 after a training course if the course had no effect (i.e., if their population mean is indistinguishable from the general/untrained mean) is so small, that it is highly unlikely to be the case. It is much likelier that the course had an effect, so that the trained employees’ population mean is not the same as the untrained ones: \mu_\overline{x} \neq\mu (and in fact \mu_\overline{x} >mu). The null hypothesis is thus not supported.

 

We therefore reject the null hypothesis and conclude that the score of 650 does not appear to be just due to random variability (otherwise it would be within 3 standard errors away from the not-trained employees’ mean — while it stands at 5, under the null hypothesis). Rather, it is statistically significantly different from 600. In other words, our evidence suggests that the training course may have affected the productivity score of employees who took it. (Again, causality aside, note that we have not proven beyond a shadow of a doubt that it did, rather that given our evidence at this point in time, we have a reason to believe it did.)

 

 

In the example above we ended up rejecting the null hypothesis. I will also show how it can turn out that we cannot reject the null hypothesis but first I will use the opportunity to 1) make a connection to a concept with which you are already familiar — confidence intervals, below; and 2) introduce two interrelated important theoretical concepts, the level of significance and the p-value, in the next section.

 

Believe it or not, hypothesis testing and confidence intervals are complementary as both testing a hypothesis and constructing a confidence interval allow us to arrive at the same conclusion. To see this, we just need to construct a, say, 95% confidence interval for mu_\overline{x} from Example 8.2 (A) above:

  • 95% CI: \overline{x}\pm1.96\times\sigma_\overline{x} = 650\pm1.96\times10=650\pm19.6=(630.4; 669.6)

 

That is, we can be 95% certain that the average score for the population of employees who take the training course would be between approximately 630 points and 670 points. The average general score of 600 points is not part of the plausible values for \mu_\overline{x}, which is consistent with our decision to reject the null hypothesis.

 

 


  1. Statistical significance has a very narrow, very specific meaning as you will learn further in this section. On the difference between statistical significance and significance in general, see warning in the Watch Out!! #15 box in the next section.
  2. The p here stands for "probability".

License

Simple Stats Tools Copyright © by Mariana Gatzeva. All Rights Reserved.

Share This Book