Chi-Squared Test of Independence

Steps for Chi-Squared Test of Independence

Learning Objectives

Define the steps and formula required to perform a Chi-Squared Test for Indepedence

Let us now present the steps and formulas we will need to perform a Chi-Squared Test. See the section below ‘Another Explanation for the [latex]\chi^2[/latex] test’ to better understand the reason why we are using the formulas we are using and what they mean.

Null and Alternate Hypotheses

We are, again, performing a hypothesis test so we need to define our null and alternate hypotheses:

H0: The two categorical variables are independent ([latex]\chi^2 = 0[/latex])

HA: The two categorical variables are dependent ([latex]\chi^2 \neq 0[/latex])

Expected Value Formula

We calculate an expected value for each category for both populations/groups. This is the frequency we would expect if the two categorical variables are independent.

\[\text{Expected Value}= \frac{\text{Row Total}\times \text{Column Total}}{\text{Grand Total}} \]

We read down the table to determine the column total and read across to determine the row total (the number of people/events in that category total). We then divide by the total number of people/events overall (‘Grand Total’).

χ2test Formula

We now take the difference between each expected value and the actual value for that category:

\[ \chi^2_{test} = \sum \frac{(obs – exp)^2}{exp} \]

[latex]\chi^2[/latex] is, essentially, a weighted average of the squared differences between the actual and expected frequencies. If it is much larger than zero, then the actual values are very different than the values we would expect if the two categorical variables were independent.

Degrees of Freedom and p-value Formula

Once we have determined the test statistic ([latex]\chi^2_{test}[/latex]), next, we should determine the associated p-value. Before we do that, we need to calculate the degrees of the freedom for the problem:

\[ \text{Degrees of Freedom} = df = (\text{#} rows – 1)\times (\text{#} columns – 1) \]

We now plug the test statistic and degrees of freedom into the CHISQ.DIST.RT(χ2test, df) Excel function:

\[\text{p-value} = \text{CHISQ.DIST.RT}(\chi^2_{test}, df) \]

If the p-value returned is much less than the level of significance, we can easily say that the deviations between the observed and the expected counts are too large to be attributed to chance (there is a dependence between the categorical variables).

Decision

Just like all the other hypothesis tests we have performed, if the p-value returned is less than the level of significance, then we reject H0. If not, we fail to reject H0. Ie:

  • if p-value < L.O.S (Level of Significance): Reject H0
  • if p-value > L.O.S (Level of Significance): Do not reject H0

Conclusion

Again, like all of the other hypothesis tests in previous sections, if we reject H0, there is sufficient evidence to conclude (whatever it is we are trying to conclude). In this case, there will be sufficient evidence to conclude that the two categorical variables are dependent if we reject H0. Ie:

  • Reject H0: There is sufficient evidence to conclude that the two categorical variables are dependent.
  • Do not reject H0:  There is not sufficient evidence to conclude that the two categorical variables are dependent.

Another Explanation for the χ2 Test

Another explanation for the [latex]\chi^2[/latex] test is:

  • We calculate the frequencies for each category that we would expect to get if there were no difference between the proportions for each group.
  • Ie: The expected values are calculated by assuming that the categories are independent of which population they belong to.
  • We then calculate a weighted squared difference between the expected frequencies we calculated and the actual frequencies
  • This difference is called the [latex]\chi^2_{test}[/latex].
  • If the value of [latex]\chi^2_{test}[/latex] is large, this means that the actual values are much different from the values we should get if the categories were independent of the population they belong to.
  • If that’s the case, we conclude that the categories cannot be independent of the populations they belong to.
  • Another conclusion we can draw is that the proportions are different between populations.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

An Introduction to Business Statistics for Analytics (1st Edition) Copyright © 2024 by Amy Goldlist; Charles Chan; Leslie Major; Michael Johnson is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book