Chapter 10 Testing Associations II: Correlation and Regression

10.1. Correlation

You will recall from Section 7.2.3 that we use the coefficient of correlation (Pearson’s) r to examine associations between two continuous variables. The correlation coefficient r varies between -1 and 1. The closer it is to either, the stronger the correlation, and the closer it is to 0, the weaker the correlation[1].

 

Where does r come from though? What does it actually measure? I doubt you have lost sleep wondering about these questions which I left unanswered in Chapter 7, but here is your chance to learn this anyway (think of it as closure of sorts).

 

The correlation coefficient is, essentially, a ratio of the variabilities of the two variables[2].

The easiest way to calculate r between a variable x and a variable y is through the distances of the observations from the means of the two variables, or more accurately, the sums of squares[3]

 

    \[r=\frac{\Sigma{(x-\overline{x})(y-\overline{y})}}{\sqrt{\Sigma{(x-\overline{x})^2}\Sigma{(y-\overline{y})^2}}}\]

 

From Section 4.3, we know that \Sigma{(x-\overline{x})}^2 is the sum of squares of the variable (so, SSx); by analogy,  \Sigma{(y-\overline{y})}^2 will be the sum of squares of the variable y (so, SSy). When the distances between an observation and the two means are “cross-multiplied” before summing (like in the numerator), they are called the sum of products (SPxy). 

 

Thus we can restate the formula above in the following simplified (and easier to remember) way[4]:

 

    \[r=\frac{SP_{xy}}{\sqrt{SS_x SS_y}}\]

 

Example 10.1(A) provides an empirical application of r‘s calculation.

 

Example 10.1(A) Education and Parental Education, GSS 2018

 

Table 10.1 lists the years of schooling (our variable y) of seven respondents in the GSS 2018 (NORC, 2019) and the years of schooling of their respective fathers (our variable x)[5]. While inference with N=7 is not a serious proposition, the small observation count allows for a quick calculation for demonstration purposes only. (After all, we already know the correlation coefficient of these exact same two variables from Section 7.2.3; there the SPSS-calculated r was equal to 0.413.)

 

The rest of the columns in Table 10.1 list the necessary computations (obtaining distances from the mean, squaring distances, summing distances, etc.) to produce SSx, SSy, and SPxy.

 

Table 10.1 Calculating Pearson’s r

x y  (x-\overline{x}) (x-\overline{x})^2 (y-\overline{y}) (y-\overline{y})^2  (x-\overline{x})(y-\overline{y})
12 8 (12-12.4) = -0.4 -0.4= 0.2 (8-13.6) = 5.6 5.6= 31.4 (-0.4)(5.6)=2.2
6 12 (6-12.4) = -6.4 -6.4= 41 (12-13.6) = -1.6 -1.6= 2.6 (-6.4)(1.6)=10.2
12 19 (12-12.4) = -0.4 -0.4= 0.2 (19-13.6) = 5.4 5.4= 29.2 (-0.4)(5.4)=-2.2
16 16 (16-12.4) = 3.6 3.6= 13 (16-13.6) = 2.4 2.4= 5.8 (3.6)(2.4)=8.6
15 12 (15-12.4) = 2.6 2.6= 6.8 (12-13.6) = -1.6 -1.6= 2.6 (2.6)(-1.6)=-4.2
12 12 (12-12.4) = -0.4 -0.4= 0.2 (12-13.6) = -1.6 -1.6= 2.6 (-0.4)(-1.6)=0.6
14 16 (14-12.4) = 1.6 1.6= 2.6 (16-13.6) = 2.4 2.4= 5.8 (1.6)(2.4)=3.8
\overline{x}12.4 \overline{y}13.6 SSx=63.7 SSy=79.7 SPxy=19.3

Then, according to the formula for r we have:

 

    \[r=\frac{SP_{xy}}{\sqrt{SS_x SS_y}}=\frac{19.3}{\sqrt{63.7\times79.7}}=\frac{19.3}{71.3}=0.271\]

 

Obviously, this r=0.271 is not the same as the SPSS-produced r=0.413 we had from Section 7.2.3; in fact, it would be very surprising if they were the same, considering the former is based on N=7 while the latter is based on N=1,687. The exact value of r in the above calculation (r=0.271) doesn’t matter, and doesn’t serve any purpose and shouldn’t be interpreted as it exists only as the end result of our demonstration.

 

Fancy trying it out on your own?

 

Do It! 10.1 Calculating Pearson’s r

 

Here are 7 more cases from the same GSS 2018 dataset. Fill out the table fully and produce r.

x y  (x-\overline{x}) (x-\overline{x})^2 (y-\overline{y}) (y-\overline{y})^2  (x-\overline{x})(y-\overline{y})
12 12
12 14
13 13
13 16
14 20
20 16
21 18
\overline{x}= \overline{y}= SSx= SSy= SPxy=

 

Even if we dismiss the value of the N=7 coefficients and go back to r=0.413 based on N=1,687, we still want to know if this correlation as observed in the sample is statistically significant (i.e., generalizable to the population). Thus, we need to test r, and we do that through a t-test.

 

The t-test for Pearson’s r is given by the following formula:

 

    \[t=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}\]

 

with df=N-2.

 

Example 10.1(B) Testing the Education and Parental Education Correlation, GSS 2018

 

As usual, it helps to know what we are testing exactly:

  • H0: There is no correlation between parental and offspring education; ρ=0.
  • Ha: There is a correlation between parental and offspring education; ρ≠0.

Then, for N=1,687 and r=0.413, we have:

 

    \[t=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}=\frac{0.413\sqrt{1687-2}}{\sqrt{1-0.413^2}}=\frac{0.413(41.1)}{0.911}=18.633\]

 

With t=18.633, df=1,685, and p=0.00001 (i.e., p=0.00001<0.5), we can reject the null hypothesis that parental and offspring education are not correlated. At this time, we have enough evidence to conclude that there is a moderately weak (r=0.413), statistically significant correlation between parental education and offspring education in the US population[6].

 

 

With this, we can have established (with 99% certainty) that parental education and offspring education are correlated. Considering that parents tend to have their schooling done before their children have theirs, on average, it is also reasonable to assume that parental education affects offspring education (and not vice versa)[7].

 

Wouldn’t then be good to know exactly how much effect parental education has on offspring education? That is, wouldn’t you like to know that if a father had one more year of schooling compared to another father, how much more schooling the child of the former would be expected to have compared to the child of the latter? One type of regression — called linear regression — can tell us just that.


  1. The sign of r is there only to indicate the direction of the association: positive or negative, nothing else. Thus this is a reminder not to use r's sign as a measure of magnitude or strength of the association. Thus, for example, -0.9 is a stronger association than 0.2 because -0.9 is closer to -1 than 0.2 is to 1. (In fact, 0.2 is much closer to 0, or no association.) That is, a strong negative correlation is stronger than a weak positive one, despite that -0.9<0.2.
  2. To be precise, the ratio is between the covariance of x and y (i.e., their joint variability, sxy) and the product of their separate variances sx and sy:  

        \[r=\frac{s_{xy}}{s_x s_y}\]

    or  

        \[\rho=\frac{\sigma_{xy}}{\sigma_x \sigma_y}\]

    if we apply it to a population instead of a sample. (Here ρ is the small-case Greek letter r, pronounced [ROH].)
  3. Recall that the sum of squares was the numerator in the formulas for the variance and the standard deviation. We take the distances of the observations from the mean, square them, and them add them altogether. (We square them before adding to turn them all positive, otherwise they'd cancel each other upon summation. See Section 4.3 (https://pressbooks.bccampus.ca/simplestats/chapter/4-3-variance/) for details.) 
  4. Note that other "versions" of the formula for r exist. All of them calculate the same r, but are just restated in different term. The two "versions" presented in the text above are the simplest. For example, one of the most common ways to express r you may find elsewhere (but which is rather hard on the eyes and for purposes of calculation by hand) is this:       \[r=\frac{N\Sigma{xy} -\Sigma{x}\Sigma{y}}{\sqrt{N\Sigma{x^2}-(\Sigma{x})^2)(N\Sigma{y^2}-(\Sigma{y})^2}}\]
  5. Here parental education is the independent variable and respondent's education is the dependent variable, so they are denoted as x and y, respectively, according to convention.
  6. Purely for demonstration purposes, we could also calculate the t for the 7 respondents whose responses we used to calculate r=0.271:  

        \[t=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}=\frac{0.271\sqrt{7-2}}{\sqrt{1-0.271^2}}=\frac{0.271(5)}{0.963}=1.407\]

      In this case, we could interpret the results like this: "With t=1.407, df=5, and p=0.218 (i.e., p=0.218>0.5), we cannot reject the null hypothesis that parental and offspring education are not correlated. At this time, we do not have enough evidence to conclude that there is a statistically significant correlation between parental education and offspring education in the US population." However, we cannot trust this "inference" as it is only based on N=7.
  7. In terms of establishing causality, we are limited by the bivariate case we have: it is entirely possible (and expected) that other things affect offspring education too, not just their parents' education. As well, it is possible than something else (for example, wealth, income, socioecoomic class, etc.) might be affecting both parental and offspring education, rendering the effect of parental educaion on offspring education spurious. These type of considerations are exactly the purpose of mutlivariate analysis, but since we are dealing with bivariate analysis here, we have to leave these considerations aside. I bring them up here to remind you not to forget them in the discussion that follows, which will focus on the two variables at hand.

License

Simple Stats Tools Copyright © by Mariana Gatzeva. All Rights Reserved.

Share This Book