Chapter 9 Testing Associations I: Difference of Means, F-test, and χ2 Test

9.3 Between Two Discrete Variables: The χ2, Part 1

As in the previous section, here you need to recall how we examine potential association between two variables both treated as discrete (Section 7.2.2, https://pressbooks.bccampus.ca/simplestats/chapter/7-2-2-between-two-discrete-variables/). We described such associations through contingency tables, reporting differences of proportions as appropriate.

 

We can start with the simplest, binary case: when the discrete variables have two groups each. Then we compare the groups of interest (categories of one variable) on one of the categories of the other variable. (The example in Chapter 7 we used was to compare the percentage of first-year students who like the campus cafeteria to the percentage of second-year students who do.)

 

The t-test for testing difference of two proportions. When we have only two proportions (or percentages) to compare, we can actually use the same t-test we used for testing differences of means, again treating the difference as a single, normally distributed statistic. Since we have categorical variables, however, and no standard deviations/variances, we resort to measuring population variability by π(1-π) and sample variability by p(1-p)[1]. (See Section 6.7.2, https://pressbooks.bccampus.ca/simplestats/chapter/6-7-2-confidence-intervals-for-proportions/.) We can thus simply substitute that into the formula for z:

 

z=\frac{(p_1 -p_2)-(\pi_1 -\pi_2 )}{\sqrt{\frac{\pi_1(1-\pi_1)}{N_1}+\frac{\pi_2(1-\pi_2)}{N_2}}}

 

where, of course, under the null hypothesis (\pi_1 -\pi_2 )=0. Then, using the sample proportions leaves us with t:

 

t=\frac{(p_1 -p_2)}{\sqrt{\frac{p_1(1-p_1)}{N_1}+\frac{p_2(1-p_2)}{N_2}}}

 

Again, under the null hypothesis the two groups’ proportions are assumed to be the same so effectively we have:

 

t=\frac{(p_1 -p_2)}{\sqrt{p(1-p)(\frac{1}{N_1}+\frac{2}{N_2})}}

 

Let’s revisit the cafeteria-preferences example from Section 7.2.2 to see how the t-test for testing difference of proportions works.

 

Example 9.3 Do You Like the Campus Cafeteria? (A t-Test)

 

In Chapter 7 we imagined that you asked 35 students in your class[2] whether they liked the campus cafeteria: 12 of your classmates said yes (i.e., 34.3 percent), 7 (out of 15) first-years and 5 (out of 20) second-years (46.7 percent of all first-years and 25 percent of all second-years, respectively).

 

We want to know whether the observed in the sample difference in proportions (0.467-0.25=0.217) is statistically significant: can it be generlized to a larger student population, or is it due to a regular sampling variability?

  • H0: The proportion of first year students who like the cafeteria is the same as the proportion of second year students who do; \pi_1=\pi_2.
  • Ha: The proportion of first year students who like the cafeteria is different than the proportion of second year students who do; \pi_1\neq\pi_2.

Substituting these numbers in the formula we have:

 

t=\frac{(p_1 -p_2)}{\sqrt{p(1-p)(\frac{1}{N_1}+\frac{2}{N_2})}}=\frac{0.467-0.25}{\sqrt{0.343(1-343)(\frac{1}{15}+\frac{1}{20})}}=\frac{0.217}{0.162}=1.34

 

With a t=1.34, df=34, and p=0.189 (i.e., p>0.05) we fail to reject the null hypothesis: at this point we do not have enough evidence to conclude there is a difference between the proportions of first and second year students who like the campus cafeteria. The 21.7 percentage points difference is not statistically significant, and has a high enough probability of being due to random chance.

 

We can check this with a confidence interval too:

  • 95% CI: (p_1 -p_2)\pm1.96\times\sqrt{\frac{p_1(1-p_1)}{N_1}+\frac{p_2(1-p_2)}{N_2}}=0.217\pm1.96\times\sqrt{\frac{0.467(0.533)}{15}+\frac{0.25(0.75)}{20}}=0.217\pm0.316=(-0.099; 0.533)

In other words, the difference between the proportion of first years and the proportion of second years who like the cafeteria could be anywhere between -9.9 percentage points and 53.3 percentage points with 95% confidence (or 19 out of 20 such samples will have a difference within this pretty large interval). The difference can be in favour of second years or in favour of the first years (notice the negative lower bound); it can even be 0. Thus, since a difference of 0 (i.e., no difference) is a plausible value, we cannot reject the null hypothesis. We conclude that we do not have enough evidence of an association between year of study and opinion on the campus cafeteria.

 

Admittedly, the formulas look scary but if you have followed through the example above, you have seen by now the actual calculation is quite simple. You can try it out and see for yourself.

 

Do It! 9.2 Vegetarianism/Veganism among Canadian and International Students

 

Imagine you are interested in exploring whether there is a difference between Canadian and international students in your university when it comes to dietary preferences like vegetarianism and veganism. With your institution’s registrar’s assistance, you take a random sample of 100 students and poll them on 1) whether they are a Canadian or an international student, and 2) whether they are vegetarian/vegan or not.

 

You find that you have 70 Canadian and 30 international students in your sample. Out of the Canadian students, 15 (or 21.4 percent) are vegetarian or vegan; out of the international students 5 (or 16.7 percent) have such dietary restrictions.

 

Check if the observed in the sample difference in proportions is generalizable to the larger student population by testing the hypothesis whether dietary preferences are associated with country of origin. Create a 95% confidence interval for that difference, and substantively interpret what you have found with both the t-test and the confidence interval.

 

Useful hint 1: Among the 100, there are 20 vegan/vegetarian students in total.

Useful hint 2: You can find the p-value of your t-statistic here: https://www.socscistatistics.com/pvalues/tdistribution.aspx.

 

Of course, discrete variables do not have to be binary: they can have more than two categories each. Just like in the case of a continuous and a discrete variables’ association discussed in the previous section where non-binary variables required the use of an F-test, there is a different test for testing the association between any two discrete variables, regardless of their respective number of categories (i.e., not just binary ones).

 

The χ2-test for testing associations between discrete variables. The χ2-test[3] (or Pearson’s χ2-test) is based on a comparison between the observed and the expected cell values in a contingency table.

 

The observed values are the cell counts you see in a contingency table given a specific dataset. The expected values, on the other hand, are the counts we would expect to see if there were no pattern/association in the data. In other words, the test effectively compares the sample to a null-hypothesis-like hypothetical distribution of the observations across the cells. Thus, logically, if there is a relatively large difference between the observed and the expected values, we can take that as evidence against the null hypothesis and reject it. If, however, the difference between observed and expected values is relatively small, the evidence against the null hypothesis will be insufficient and we would fail to reject it.

 

The actual way the χ2 is calculated is this:

 

    \[\chi^2=\Sigma\frac{(f_o -f_e)^2}{f_e}\]

 

where fo is the observed frequency (count) and fe is the expected frequency count of a given cell.

 

The formula looks more complicated than it is (don’t they always?) — it only asks us to calculate the difference between the observed and the expected count for each cell, square it and divide it by the expected count.  Once we have done this for all cells, we need only add the resulting numbers together to get the χ2 .

 

Considering that the χ2 is then a sum of as many numbers as there are cells, the larger the table (i.e., the more rows and columns there are), the bigger the resulting χ2 will be. To account for that, the χ2 too has degrees of freedom, where the df=(rows-1)(columns-1). The χfollows a χ2distribution, which too provides a p-value given specific df.

 

The hypothesis testing then follows the same steps as the t-test and the F-test: obtain χ2-value with specific df, find its associated p-value, and finally compare the p-value to the pre-selected significance level. If p<α, reject the null hypothesis.

 

To demonstrate, we will first do a one-way χ2 calculation, i.e., based on the frequency distribution of just one variable. (Of course, if tabulated, this would not be considered a contingency table but a frequency table.)

 

Example 9.4 Do You Like The Campus Cafeteria? (Univariate χ2-Test)

 

To use the imaginary data from before, we had 12 people who admitted liking the campus cafeteria food out of the 35 polled. (Since we are interested only in one of the variables, here we ignore whether the students who like the cafeteria are first- or second-years.) As such, we have the following table:

 

Table 9.1  Approval of the Campus Cafeteria, Observed Count (Univariate)  

Yes 12
No 23
Total 35

 

If you did not know anything about the campus cafeteria and had no observations about it whatsoever — i.e., had you been an impartial observer, as it were — wouldn’t you expect to see an approximately 50/50 split of the 35 students into the two categories? After all, there are only two groups, and an unbiased (random) distribution would be exactly like everyone flipping a coin as a manner of deciding in which group they end up. Thus, the expected count here is simply N divided by the number of groups/categories (denoted by k):

 

f_e=\frac{N}{k}=\frac{35}{2}=17.5

 

Table 9.2 adds the expected count in brackets next to the observed count.

 

 

Table 9.2 Approval of the Campus Cafeteria, Observed and Expected Count (Univariate)

Yes 12    (17.5)
No 23    (17.5)
Total 35

 

Then, according to the formula, this is what we have for each of the two groups:

  • Yes-group: \frac{(f_o-f_e)^2}{f_e}=\frac{(12-17.5)^2}{17.5}=\frac{30.25}{17.5}=1.73
  • No-group: \frac{(f_o-f_e)^2}{f_e}=\frac{(23-17.5)^2}{17.5}=\frac{30.25}{17.5}=1.73

Finally, to get the χ2 we only need to add these two numbers together:

 

\chi^2=\Sigma\frac{(f_o -f_e)^2}{f_e}= \frac{(12-17.5)^2}{17.5}+\frac{(23-17.5)^2}{17.5}=1.73+1.73=3.46

 

The degrees of freedom in a one-way χ2-test is k-1, where k is the number of categories/groups. In this case we have k=2, so df=1.

 

With a χ=3.45, df=1, and a p=0.06[4] (i.e., p>0.05), we fail to reject the null hypothesis. At this time, we do not have enough evidence to conclude that the observed distribution of the students is unusual enough to suggest a pattern which is different than a random variation of a 50/50 split. As such, this distribution is not statistically significant — we cannot conclude that the students lean one way or the other in their opinion about the campus cafeteria.

 

 

Calculating a two-way χ— by far the more often used one as it tests associations between two variables — is just as easy, even if it involves calculating more numbers (since in the bivariate case we have more cells; four at the minimum, given a 2×2 cross-tabulation). The next section is devoted to that.

 


  1. Do not forget that p here stands for proportion, not probability/p-value.
  2. Note that this of course is not a random sample; we are using it here only for illustrating how hypothesis testing works so we are effectively pretending it is random. In a real-life study, you should not use non-probability samples for statistical inference.
  3. This is the small-case Greek letter h, χ. It is pronounced [KHAI], but since it is transliterated as chi, many people incorrectly pronounce it as [CHAI] or even [CHEE]. The test itself is called chi-squared test (again, pronounced as [KHAI- squared] not [CHAI- or CHEE-squared]).
  4. You can check the significance of any χwith a convenient online calculator, like this one here: https://www.socscistatistics.com/pvalues/chidistribution.aspx.

License

Simple Stats Tools Copyright © by Mariana Gatzeva. All Rights Reserved.

Share This Book