{"id":976,"date":"2019-03-21T23:20:26","date_gmt":"2019-03-22T03:20:26","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/simplestats\/?post_type=chapter&#038;p=976"},"modified":"2019-11-02T23:53:46","modified_gmt":"2019-11-03T03:53:46","slug":"7-2-3-between-two-continuous-variables","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/simplestats\/chapter\/7-2-3-between-two-continuous-variables\/","title":{"raw":"7.2.3 Between Two Continuous Variables","rendered":"7.2.3 Between Two Continuous Variables"},"content":{"raw":"&nbsp;\r\n\r\nThe distinctive feature of continuous variables is their large number of values. As discussed previously, typically we treat most interval\/ratio variables as continuous. However, sometimes ordinal variables too can have a number of categories, large enough to justify their treatment as continuous for the purposes of statistical analysis. (Think back to the previous Section 7.2.2 and imagine crosstabulating a variable with, say, 10+ categories on another; the resulting table will be too unwieldy for meaningful examination.)\r\n\r\n&nbsp;\r\n\r\nAs well, continuous variables have values of different magnitudes, which can be ordered from low to high. Thus, what we will be looking for when examining two such variables for a possible association is whether a pattern exists between their values, or, alternatively, if their values do not exhibit any predictable combination. While many types of patterns can exists, for the purposes of this introductory text we'll focus on the two simplest ones: a <em>positive linear <\/em>association and a <em>negative linear<\/em> association. The way we describe and examine such associations is visually through a graph called a\u00a0<em>scatterplot<\/em> and numerically through a special indicator called <em>Pearson's correlation coefficient r<\/em> (or <em>Pearson's r<\/em>, or just <em>r<\/em>). I explain both below.\r\n\r\n&nbsp;\r\n\r\n<strong>A positive linear association is a pattern in which low values of one variable go with low values of the other variable alongside with high values of the former going with high values of the latter.<\/strong> That is, in a positive linear association when the values of Variable 1 increase or decrease, so do the values of Variable 2. As its name suggests, <strong>a negative linear association is the exact opposite: low values of one variable go with high values of the other variable and vice versa.<\/strong> Then, as the values of Variable 1 <em>increase<\/em>, the values of Variable 2 will tend to <em>decrease<\/em>, or vice versa.\r\n\r\n&nbsp;\r\n\r\nBoth the positive and the negative version of this pattern are called <em>linear<\/em> because plotting the values of the two variables on a coordinate system shows the data points \"congregating\" in an approximately \"straight\" fashion, as if along an imaginary straight line with an upward (i.e., positive) or downward (i.e., negative) slope[footnote]Other than linear associations exists, e.g., <em>curvilinear<\/em> (imagine U-shaped or inverted U-shaped <em>curves<\/em> in the data, instead of a straight line). Analyzing these is more complicated and beyond the scope of this book. The discussion hereafter will consider only bivariate linear associations associations, regardless if I mention it explicitly or not. [\/footnote].\r\n\r\n&nbsp;\r\n\r\nConsider the following example two figures.\r\n\r\n&nbsp;\r\n\r\n<em>Figure 7.3(A)<\/em>\u00a0<em>Positive Association: Test Scores by Class Attendance (Simulated Data<\/em>[footnote]The simulated data used here for illustration purposes only is provided by DataBake (www.databake.io). [see terms of use 3.6, 3.7: (free) datasets can be copied, modified, stored or otherwise used for your own personal, academic, or internal business purposes\"][\/footnote]<em>)<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1011 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\nIn the <strong>scatterplot<\/strong> in Figure 7.3(A) above, I have plotted data from 35 imaginary students on their class attendance and subsequent final test scores[footnote]The data is called <em>simulated<\/em> as it's computer-generated for the purposes of the exercise.[\/footnote]. Both <em>class attendance<\/em> and <em>test scores<\/em> are continuous variables. (Attendance is a ratio variable measuring proportion of the class time attended while test scores is an interval variable measured in percentages.) Each point of the data represents <em>simultaneously<\/em> a student's attendance (on the horizontal axis) and a student's test score (on the vertical axis); e.g., the lowest\/left-most data point stands for a student who attended about 20% of class time and scored less than 20% on the final exam. The data points look \"scattered\" all over the graph, hence the name <em>scatterplot<\/em>.\r\n\r\n&nbsp;\r\n\r\nYou can easily see the pattern in the data in Figure 7.3(A): lower attendance seems to go with lower test cores, and higher attendance with higher scores. The bottom right side (high attendance\/low scores) and the top left side (low attendance\/high scores) of the graph are empty: there seem to be no students who attended classes a lot but scored low on the test, nor students who didn't attend much but scored high on the test. Had there been no pattern, the data points would spread all over the graph, identifying no clear \"congregation\" of values based on their magnitude.\r\n\r\n&nbsp;\r\n\r\nSince class attendance and test scores seem to go <em>concordantly<\/em> \"together\" (i.e., low\/low and high\/high), we have indication of a <em>positive<\/em> association.\r\n\r\n&nbsp;\r\n\r\n<em>Figure 7.4(A)<\/em>\u00a0<em>Negative Association: Test Scores by Time Spent On Social Media (Simulated Data)<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1012 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\nAgain, both <em>time on social media<\/em> and <em>test scores<\/em> are continuous variables, with time on social media measured in average hours per day.\r\n\r\n&nbsp;\r\n\r\nThe pattern in Figure 7.3(A) is the opposite of the one we had before: lower number of hours spent on social media seem to go with higher test cores, and higher social media usage with lower scores. This time, the bottom left side (low on social media\/low scores) and the top right side (high on social media\/high scores) of the graph are empty: there seem to be no students who spent very little time on social media but scored low on the test nor students who had high usage of social media but scored high on the test.\r\n\r\n&nbsp;\r\n\r\nSince social media usage and test scores seem to go <em>discordantly<\/em> \"together\" (i.e., low\/high and high\/low), here we have an indication of a <em>negative<\/em> association.\r\n\r\n&nbsp;\r\n\r\nFigure 7.3(B) and Figure 7.4(B) below make the point about linearity clearer by adding something called a <em>line of best fit\u00a0<\/em>to the original graphs[footnote]We discuss the line of best fit (aka regression line) in Chapter 10 later.[\/footnote]. <strong>The slope of the line indicates the nature of the supposed association: upward\/positive or downward\/negative.<\/strong>\r\n\r\n&nbsp;\r\n\r\nFigure 7.3(B)\u00a0<em>Positive Association: Test Scores by Class Attendance With Line of Best Fit<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-line.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1016 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\nFigure 7.4(B)\u00a0<em>Negative Association: Test Scores by Time Spent On Social Media With Line of Best Fit<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-lineA.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1019 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\nCompare the slopes of the lines in the figures above to the one in Figure 7.5 below.\r\n\r\n&nbsp;\r\n\r\n<em>Figure 7.5 No Association: Test Scores By Student Number in Class (Selected Scores)<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-flat-lineA.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1022 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\nThe graph in Figure 7.5 above plots the non-existent association between a student's number in in the class and their final test score. Of course, this is a bogus \"association\" which I'm showing here only as an example of <strong>a <em>flat<\/em> line of best fit, an indication that the two variable have nothing to do with each other<\/strong>. The line in Figure 7.5\u00a0 is not perfectly flat, however, so it helps to have a numerical indication of association in addition to the visual ones the scatterplots give us.\r\n\r\n&nbsp;\r\n\r\nBefore we get to that, a word of warning. The presumption of linearity for this type of analysis is <em>very<\/em> important and <strong>you should make sure to not impose linearity where it doesn't exist<\/strong>. The caveat below explains.\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--learning-objectives\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch Out!! #14\u00a0<\/strong><\/span>. . .\u00a0 For Non-Linear Associations<\/em><\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n&nbsp;\r\n\r\nData points without a pattern produce a flat\u00a0(i.e., with no linear slope) line of best fit, as shown in Figure 7.5 above. However, <strong>data points in a non-linear patter will also result in a flat (i.e., with no linear slope) line of best fit<\/strong>, if we insist on seeing the variables as linearly associated. This can lead to dismissing a potential association only because it's non-linear, which would be a mistake. While this textbook doesn't go into non-linear associations, this doesn't mean they do not exist or they are not important: on the contrary, but they do require you to use different methods to investigate them.\r\n\r\n&nbsp;\r\n\r\nMy warning here is simple: <strong>When working with given data, keep an eye on potential non-linearity. Otherwise you may incorrectly assume no association when in fact a non-linear association exists.\u00a0<\/strong>Figure 7.6 below illustrates.\r\n\r\n&nbsp;\r\n\r\n<em>Figure 7.6\u00a0Curvilinear Association: Test Scores By Student Number in Class (All Scores)<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-CUBIC-flat-line.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1023 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\nSurprisingly enough, Figure 7.6\u00a0 shows that students at the beginning and at the end of the class list scored lower on their final test than their peers for whatever reason, or simply by chance (my bet would be on the latter).\r\n\r\n&nbsp;\r\n\r\nRegardless of the reason or lack thereof, my goal here is to show you that imposing linearity by drawing a <em>linear<\/em> line of best fit will end up as a flat line, which one hastily may take as an indication of no association (see the straight blue line on the graph). A closer and more careful look, however, reveals the inverted-U shape pattern of the data points in the scatterplot: As the student numbers increase initially, so do the test scores. Then, as the student numbers continue to increase, the test scores start decreasing (see the curved red line following the data points much more closely that the blue flat one). This is clearly a pattern that should not be ignored in any serious, real-life study.\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\n<\/div>\r\n&nbsp;\r\n\r\nA visual summary of the data and any potential bivariate associations like the scatterplot is thus very useful. Scatterplots are in fact rather indispensable if one is to base their analysis on the assumption of a linear association between two continuous variables. Still, like in the previous two cases of two discrete variables and a discrete and a continuous variable, a numerical summary of the potential association can be of great help.\r\n\r\n&nbsp;\r\n\r\nFor discrete variables we could examine and report differences of proportions, while for a discrete and continuous variables we use differences of means (or medians). In both cases we could compare groups (on proportions, or means). In the case of continuous variables, we have neither groups, nor a convenient number to compare them on. Instead, here we have a correlation coefficient,\u00a0<em>Pearson's r<\/em>. The correlation coefficient takes all data points simultaneously and summarizes to what extent certain values of one of the variables go with certain values of the other variable (i.e., if they form a pattern or they vary independently of each other).\r\n\r\n&nbsp;\r\n\r\nWhile we will examine the exact definition and calculation of the Pearson's <em>r<\/em> in Chapter 10 later, for now we'll focus on its interpretation.\r\n\r\n&nbsp;\r\n\r\n<strong>The correlation coefficient <em>r<\/em> is a number between -1 and +1, indicating the strength of any possible (linear) association between two continuous variables.<\/strong> However, there is a catch: <strong>the strength of the association is calculated in absolute terms while the \u00b1 sign is there to indicate whether the association is positive or negative<\/strong>. Thus, both <em>r<\/em>=-1 and<em> r<\/em>=1 stand for the strongest possible (i.e., perfect) correlation, the former\u00a0perfect <em>negative association<\/em>, the latter\u00a0perfect <em>positive association.<\/em> Between them is <em>r<\/em>=0, or no association.\r\n\r\n&nbsp;\r\n\r\nWhile perfect correlations (<em>r<\/em>=\u00b11) are very rare (if not non-existent)[footnote]The obvious exception here is the correlation of a variable on itself, which will produce <em>r<\/em>=1.[\/footnote], most variables's associations are somewhere between 0 and\u00a0\u00b11.\u00a0 <strong>The closer a correlation is to 0, the weaker it is; the closer the correlation is to -1 or +1, the stronger it is.<\/strong> Typically, in the social sciences a correlation of about <em>r<\/em>=\u00b10.7 would be considered strong, a correlation of about <em>r<\/em>=\u00b10.5 would be considered moderate, and a correlation about <em>r<\/em>=\u00b10.3 would be considered weak. Correlations around \u00b10.8 or\u00a0\u00b10.9 would therefore be very strong, while associations around\u00a0\u00b10.2 and\u00a0\u00b10.1 would be quite weak.\r\n\r\n&nbsp;\r\n\r\nNow that you are well-equipped with knowledge about interpreting correlations, let's see what the correlations of the associations discussed above were.\r\n\r\n&nbsp;\r\n\r\nFirst we looked at class attendance and test scores (Figures 7.3(A) and 7.3(B)); the correlation between the two variables was a very strong\u00a0<em>r<\/em>=0.881. Then, we looked at social media usage and test scores (Figures 7.4(A) and 7.4(B)), where the correlation was equally strong\u00a0<em>r<\/em>=-0.882[footnote]If you're wondering why the correlations appear to be of the same strength, the reason lies in the way I created the synthetic variable <em>social media usage<\/em> -- as an inversion of the simulated variable <em>class attendance<\/em>. I did warn you the data is made up as a heuristic. (Do not take this to mean that such associations -- between attendance and class performance and social media usage and test scores -- do not exist in real life.) [\/footnote]. Finally, we discussed the practically non-existent linear association between student number and test scores (of selected students, Figure 7.5) whose <em>r<\/em>=0.049, while the improperly imposed linearity in Figure 7.6 from the caveat had a similar so-weak-almost-zero linear correlation of <em>r<\/em>=-0.051.\r\n\r\n&nbsp;\r\n\r\nTired of fake data? Ready to return to the real world of sociological research? Then let's take a real example with existing data and see how it all works out.\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em>Example 7.5<\/em>\u00a0<em>Intergenerational Reproduction of Privilege in Education in the USA (GSS 2018)<\/em><\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n&nbsp;\r\n\r\nFor this example I usd data from the National Opinion Research Center's (NORC) at the University of Chicago <em>General Social Survey (GSS) 2018<\/em>. I'm interested in exploring whether <em>father's education<\/em> and the <em>education<\/em> of the respondent are potentially correlated. Both father's education and education of the respondent are measured in years of schooling, ranging from 0 (no education) to 20 years. As such they are discrete ratio variables which we can treat as continuous due to their number of values being quite large (twenty-one to be precise). Figure 7.7 shows the relevant scatterplot.\r\n\r\n&nbsp;\r\n\r\n<em>Figure 7.7\u00a0Respondent's Years of Schooling<\/em><em style=\"text-indent: 1em;font-size: 1rem\">\u00a0by Father's Years of Schooling (GSS, 2018)<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-paeduc-educ-gss-line.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1034 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\nThere are several thing to note in the graph above. One is that the data points look less \"scattered\" and more orderly arranged in neat rows and columns than would be the case, had we variables with much larger number of values. Furthermore, while <em>N<\/em>=1,687, there are much fewer data points on the scatterplot: the reason, of course, is that there are many observations \"on top\" of each other, i.e., most data points represent more than one person's combination of their own years of education and their respective father's years of education. (After all, most such combination are unlikely to be unique; we can arguably expect there to be more than one respondent and their father both having, say, 12 years of education in the dataset.)\r\n\r\n&nbsp;\r\n\r\nSubstantively, however, what do we see in the scatterplot above? To the extent that there are respondents with low levels of education, they seem to have fathers with low levels of education too. As well, while respondents with higher levels of education seem to have fathers with all levels of education, those with higher parental education appear to be more than those with lower parental education. (That is, both the left and the right side of the upper half of the scatterplot have many observations, but the top right area do seem to contain more observations than the top left area). Finally, and most importantly, there seem to be almost no respondents with low levels of education whose fathers had high levels of education (note the empty bottom right area of the graph).\r\n\r\n&nbsp;\r\n\r\nAll in all, it seems like more years of father's education \"go\" with more years of respondent's education, and fewer years of father's education \"go\" with fewer years of respondent's education -- though not completely so, or the top left area of the graph (the less educated fathers with more educated offspring) would be empty too. This is reflected in the line of best fit whose slope, while positive, is not very steep.\r\n\r\n&nbsp;\r\n\r\n<strong>Ultimately, the scatterplot indicates that <em>father's education<\/em> and<em> respondent's education<\/em> seem positively associated in the dataset but also that this association is not very strong.<\/strong> That is, there appears to be intergenerational reproduction of privilege in education, however, fortunately, one's father's lower levels of education don't seem to completely preclude one's own educational attainment.\r\n\r\n&nbsp;\r\n\r\nThe correlation coefficient provides a numerical summary of the potential association described above.\r\n\r\n&nbsp;\r\n\r\n<em>Table 7.4<\/em>\u00a0<em>Correlation between Father's Years of Schooling and Respondent's Years of Schooling (GSS 2018)<\/em>\r\n\r\n<img src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/correlation-paeduc-educ-gss.png\" alt=\"\" width=\"492\" height=\"262\" class=\"wp-image-1035 size-full aligncenter\" \/>\r\n\r\n&nbsp;\r\n\r\n<strong>SPSS's output provides <em>r<\/em>\u00a0as \"Pearson Correlation\", and here <em>r<\/em>=0.413. As suspected, this reflects positive a moderate\/moderately-weak association.<\/strong>[footnote]Note that SPSS's bivariate correlation tables are 2x2 tables, with the information repeated twice. Thus, while four coefficients are provided in the central cells of the table, they are actually two pairs of the same two correlations. (That is, correlations are symmetric: correlating <em>Variable 1<\/em> on <em>Variable 2<\/em> is the same as correlating <em>Variable 2<\/em> on <em>Variable 1<\/em>.) As well, one of these two pairs is always equal to 1, as a variable correlated on itself is a perfect correlation. This is shown in the table as <em>corr<\/em>(<em>Highest year school completed, Highest years school completed, father<\/em>)=0.413=(<em>Highest year school completed, father, Highest years school completed<\/em>) and <em>corr<\/em>(<em>Highest year school completed, Highest year school completed<\/em>)=1=(<em>Highest year school completed, father, Highest year school completed, father<\/em>).[\/footnote]\r\n\r\n&nbsp;\r\n\r\n<\/div>\r\n<\/div>\r\n<header><\/header><header><\/header><header class=\"textbox__header\">To summarize, you can describe and examine potential associations between continuous variables through scatterplots with lines of best fit (looking for a concordant or discordant pattern in the data points) and the coefficient of correlation <em>r<\/em> (ranging from 0 to\u00a0\u00b11 in strength, with 0 standing for\u00a0 no correlation and\u00a0\u00b11 constituting a perfect negative or a perfect positive correlation).<\/header><header><\/header><header><\/header><header>Before we move on, the tip below shows how to get the visual and the numerical summary of continuous bivariate associations in SPSS.<\/header><header><\/header><header><\/header><header>\r\n<div class=\"textbox textbox--key-takeaways\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\"><em>SPSS Tip 7.3\u00a0Scatterplot and Correlation Coefficient<\/em><\/p>\r\n\r\n<\/header>&nbsp;\r\n\r\n<strong>For Scatterplots:<\/strong>\r\n<ul>\r\n \t<li><span style=\"font-size: 1rem;text-indent: 0px\">From the <em>Main Menu<\/em> select <em>Graphs<\/em> and, from the pull-down menu, <em>Legacy Dialogues<\/em>; click on <em>Scatter\/Dot<\/em>;<\/span><\/li>\r\n \t<li>Keep the pre-selected <em>Simple Scatter<\/em> option and click <em>Define<\/em>;<\/li>\r\n \t<li>In the new window, select one by your variables of interest from the list on the left and, using the arrow buttons, move them to the <em>X-Axis<\/em> and <em>Y-Axis<\/em>[footnote]At this point, it doesn't really matter which one you put in the X- or Y-Axis though I would suggest placing the variable that precedes the other in time (like father'd education generally precedes offspring's education) in the X-Axis. The reasons for this will be explained in Chapter 10.[\/footnote] empty spaces on the right; click <em>OK<\/em>.<\/li>\r\n \t<li>The <em>Output<\/em> window will show the resulting scatterplot; double-clicking on it will open a <em>Chart Editor<\/em> window from where you can change the text, colours, size, etc. of the graph to suit your needs.<\/li>\r\n<\/ul>\r\n<strong>For the correlation coefficient (Pearson's<em> r<\/em>):<\/strong>\r\n<ul>\r\n \t<li>From the <em>Main Menu<\/em>, select <em>Analyze<\/em>;<\/li>\r\n \t<li>From the pull-down menu, select <em>Correlate<\/em> and then <em>Bivariate<\/em>;<\/li>\r\n \t<li>In the resulting window, select one at a time your two variables of interest from the list on the left and, using the arrow button, move them to the <em>Variables<\/em> space on the right (the order is not important); click <em>OK<\/em>.<\/li>\r\n \t<li>The <em>Output<\/em> window will display a symmetric 2x2 table with your requested correlation coefficient.<\/li>\r\n<\/ul>\r\n&nbsp;\r\n\r\n<\/div>\r\n&nbsp;\r\n\r\n<\/header>","rendered":"<p>&nbsp;<\/p>\n<p>The distinctive feature of continuous variables is their large number of values. As discussed previously, typically we treat most interval\/ratio variables as continuous. However, sometimes ordinal variables too can have a number of categories, large enough to justify their treatment as continuous for the purposes of statistical analysis. (Think back to the previous Section 7.2.2 and imagine crosstabulating a variable with, say, 10+ categories on another; the resulting table will be too unwieldy for meaningful examination.)<\/p>\n<p>&nbsp;<\/p>\n<p>As well, continuous variables have values of different magnitudes, which can be ordered from low to high. Thus, what we will be looking for when examining two such variables for a possible association is whether a pattern exists between their values, or, alternatively, if their values do not exhibit any predictable combination. While many types of patterns can exists, for the purposes of this introductory text we&#8217;ll focus on the two simplest ones: a <em>positive linear <\/em>association and a <em>negative linear<\/em> association. The way we describe and examine such associations is visually through a graph called a\u00a0<em>scatterplot<\/em> and numerically through a special indicator called <em>Pearson&#8217;s correlation coefficient r<\/em> (or <em>Pearson&#8217;s r<\/em>, or just <em>r<\/em>). I explain both below.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>A positive linear association is a pattern in which low values of one variable go with low values of the other variable alongside with high values of the former going with high values of the latter.<\/strong> That is, in a positive linear association when the values of Variable 1 increase or decrease, so do the values of Variable 2. As its name suggests, <strong>a negative linear association is the exact opposite: low values of one variable go with high values of the other variable and vice versa.<\/strong> Then, as the values of Variable 1 <em>increase<\/em>, the values of Variable 2 will tend to <em>decrease<\/em>, or vice versa.<\/p>\n<p>&nbsp;<\/p>\n<p>Both the positive and the negative version of this pattern are called <em>linear<\/em> because plotting the values of the two variables on a coordinate system shows the data points &#8220;congregating&#8221; in an approximately &#8220;straight&#8221; fashion, as if along an imaginary straight line with an upward (i.e., positive) or downward (i.e., negative) slope<a class=\"footnote\" title=\"Other than linear associations exists, e.g., curvilinear (imagine U-shaped or inverted U-shaped curves in the data, instead of a straight line). Analyzing these is more complicated and beyond the scope of this book. The discussion hereafter will consider only bivariate linear associations associations, regardless if I mention it explicitly or not.\" id=\"return-footnote-976-1\" href=\"#footnote-976-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>Consider the following example two figures.<\/p>\n<p>&nbsp;<\/p>\n<p><em>Figure 7.3(A)<\/em>\u00a0<em>Positive Association: Test Scores by Class Attendance (Simulated Data<\/em><a class=\"footnote\" title=\"The simulated data used here for illustration purposes only is provided by DataBake (www.databake.io). [see terms of use 3.6, 3.7: (free) datasets can be copied, modified, stored or otherwise used for your own personal, academic, or internal business purposes&quot;]\" id=\"return-footnote-976-2\" href=\"#footnote-976-2\" aria-label=\"Footnote 2\"><sup class=\"footnote\">[2]<\/sup><\/a><em>)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1011 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>In the <strong>scatterplot<\/strong> in Figure 7.3(A) above, I have plotted data from 35 imaginary students on their class attendance and subsequent final test scores<a class=\"footnote\" title=\"The data is called simulated as it's computer-generated for the purposes of the exercise.\" id=\"return-footnote-976-3\" href=\"#footnote-976-3\" aria-label=\"Footnote 3\"><sup class=\"footnote\">[3]<\/sup><\/a>. Both <em>class attendance<\/em> and <em>test scores<\/em> are continuous variables. (Attendance is a ratio variable measuring proportion of the class time attended while test scores is an interval variable measured in percentages.) Each point of the data represents <em>simultaneously<\/em> a student&#8217;s attendance (on the horizontal axis) and a student&#8217;s test score (on the vertical axis); e.g., the lowest\/left-most data point stands for a student who attended about 20% of class time and scored less than 20% on the final exam. The data points look &#8220;scattered&#8221; all over the graph, hence the name <em>scatterplot<\/em>.<\/p>\n<p>&nbsp;<\/p>\n<p>You can easily see the pattern in the data in Figure 7.3(A): lower attendance seems to go with lower test cores, and higher attendance with higher scores. The bottom right side (high attendance\/low scores) and the top left side (low attendance\/high scores) of the graph are empty: there seem to be no students who attended classes a lot but scored low on the test, nor students who didn&#8217;t attend much but scored high on the test. Had there been no pattern, the data points would spread all over the graph, identifying no clear &#8220;congregation&#8221; of values based on their magnitude.<\/p>\n<p>&nbsp;<\/p>\n<p>Since class attendance and test scores seem to go <em>concordantly<\/em> &#8220;together&#8221; (i.e., low\/low and high\/high), we have indication of a <em>positive<\/em> association.<\/p>\n<p>&nbsp;<\/p>\n<p><em>Figure 7.4(A)<\/em>\u00a0<em>Negative Association: Test Scores by Time Spent On Social Media (Simulated Data)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1012 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Again, both <em>time on social media<\/em> and <em>test scores<\/em> are continuous variables, with time on social media measured in average hours per day.<\/p>\n<p>&nbsp;<\/p>\n<p>The pattern in Figure 7.3(A) is the opposite of the one we had before: lower number of hours spent on social media seem to go with higher test cores, and higher social media usage with lower scores. This time, the bottom left side (low on social media\/low scores) and the top right side (high on social media\/high scores) of the graph are empty: there seem to be no students who spent very little time on social media but scored low on the test nor students who had high usage of social media but scored high on the test.<\/p>\n<p>&nbsp;<\/p>\n<p>Since social media usage and test scores seem to go <em>discordantly<\/em> &#8220;together&#8221; (i.e., low\/high and high\/low), here we have an indication of a <em>negative<\/em> association.<\/p>\n<p>&nbsp;<\/p>\n<p>Figure 7.3(B) and Figure 7.4(B) below make the point about linearity clearer by adding something called a <em>line of best fit\u00a0<\/em>to the original graphs<a class=\"footnote\" title=\"We discuss the line of best fit (aka regression line) in Chapter 10 later.\" id=\"return-footnote-976-4\" href=\"#footnote-976-4\" aria-label=\"Footnote 4\"><sup class=\"footnote\">[4]<\/sup><\/a>. <strong>The slope of the line indicates the nature of the supposed association: upward\/positive or downward\/negative.<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p>Figure 7.3(B)\u00a0<em>Positive Association: Test Scores by Class Attendance With Line of Best Fit<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-line.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1016 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-line.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-line-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-line-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-line-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-students-attendance-testscore-line-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Figure 7.4(B)\u00a0<em>Negative Association: Test Scores by Time Spent On Social Media With Line of Best Fit<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-lineA.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1019 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-lineA.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-lineA-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-lineA-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-lineA-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterdplotstudents-attendance-social-media-lineA-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Compare the slopes of the lines in the figures above to the one in Figure 7.5 below.<\/p>\n<p>&nbsp;<\/p>\n<p><em>Figure 7.5 No Association: Test Scores By Student Number in Class (Selected Scores)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-flat-lineA.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1022 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-flat-lineA.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-flat-lineA-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-flat-lineA-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-flat-lineA-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-flat-lineA-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>The graph in Figure 7.5 above plots the non-existent association between a student&#8217;s number in in the class and their final test score. Of course, this is a bogus &#8220;association&#8221; which I&#8217;m showing here only as an example of <strong>a <em>flat<\/em> line of best fit, an indication that the two variable have nothing to do with each other<\/strong>. The line in Figure 7.5\u00a0 is not perfectly flat, however, so it helps to have a numerical indication of association in addition to the visual ones the scatterplots give us.<\/p>\n<p>&nbsp;<\/p>\n<p>Before we get to that, a word of warning. The presumption of linearity for this type of analysis is <em>very<\/em> important and <strong>you should make sure to not impose linearity where it doesn&#8217;t exist<\/strong>. The caveat below explains.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--learning-objectives\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em><span style=\"color: #ff0000\"><strong>Watch Out!! #14\u00a0<\/strong><\/span>. . .\u00a0 For Non-Linear Associations<\/em><\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>&nbsp;<\/p>\n<p>Data points without a pattern produce a flat\u00a0(i.e., with no linear slope) line of best fit, as shown in Figure 7.5 above. However, <strong>data points in a non-linear patter will also result in a flat (i.e., with no linear slope) line of best fit<\/strong>, if we insist on seeing the variables as linearly associated. This can lead to dismissing a potential association only because it&#8217;s non-linear, which would be a mistake. While this textbook doesn&#8217;t go into non-linear associations, this doesn&#8217;t mean they do not exist or they are not important: on the contrary, but they do require you to use different methods to investigate them.<\/p>\n<p>&nbsp;<\/p>\n<p>My warning here is simple: <strong>When working with given data, keep an eye on potential non-linearity. Otherwise you may incorrectly assume no association when in fact a non-linear association exists.\u00a0<\/strong>Figure 7.6 below illustrates.<\/p>\n<p>&nbsp;<\/p>\n<p><em>Figure 7.6\u00a0Curvilinear Association: Test Scores By Student Number in Class (All Scores)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-CUBIC-flat-line.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1023 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-CUBIC-flat-line.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-CUBIC-flat-line-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-CUBIC-flat-line-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-CUBIC-flat-line-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-student-number-test-score-CUBIC-flat-line-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Surprisingly enough, Figure 7.6\u00a0 shows that students at the beginning and at the end of the class list scored lower on their final test than their peers for whatever reason, or simply by chance (my bet would be on the latter).<\/p>\n<p>&nbsp;<\/p>\n<p>Regardless of the reason or lack thereof, my goal here is to show you that imposing linearity by drawing a <em>linear<\/em> line of best fit will end up as a flat line, which one hastily may take as an indication of no association (see the straight blue line on the graph). A closer and more careful look, however, reveals the inverted-U shape pattern of the data points in the scatterplot: As the student numbers increase initially, so do the test scores. Then, as the student numbers continue to increase, the test scores start decreasing (see the curved red line following the data points much more closely that the blue flat one). This is clearly a pattern that should not be ignored in any serious, real-life study.<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>A visual summary of the data and any potential bivariate associations like the scatterplot is thus very useful. Scatterplots are in fact rather indispensable if one is to base their analysis on the assumption of a linear association between two continuous variables. Still, like in the previous two cases of two discrete variables and a discrete and a continuous variable, a numerical summary of the potential association can be of great help.<\/p>\n<p>&nbsp;<\/p>\n<p>For discrete variables we could examine and report differences of proportions, while for a discrete and continuous variables we use differences of means (or medians). In both cases we could compare groups (on proportions, or means). In the case of continuous variables, we have neither groups, nor a convenient number to compare them on. Instead, here we have a correlation coefficient,\u00a0<em>Pearson&#8217;s r<\/em>. The correlation coefficient takes all data points simultaneously and summarizes to what extent certain values of one of the variables go with certain values of the other variable (i.e., if they form a pattern or they vary independently of each other).<\/p>\n<p>&nbsp;<\/p>\n<p>While we will examine the exact definition and calculation of the Pearson&#8217;s <em>r<\/em> in Chapter 10 later, for now we&#8217;ll focus on its interpretation.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>The correlation coefficient <em>r<\/em> is a number between -1 and +1, indicating the strength of any possible (linear) association between two continuous variables.<\/strong> However, there is a catch: <strong>the strength of the association is calculated in absolute terms while the \u00b1 sign is there to indicate whether the association is positive or negative<\/strong>. Thus, both <em>r<\/em>=-1 and<em> r<\/em>=1 stand for the strongest possible (i.e., perfect) correlation, the former\u00a0perfect <em>negative association<\/em>, the latter\u00a0perfect <em>positive association.<\/em> Between them is <em>r<\/em>=0, or no association.<\/p>\n<p>&nbsp;<\/p>\n<p>While perfect correlations (<em>r<\/em>=\u00b11) are very rare (if not non-existent)<a class=\"footnote\" title=\"The obvious exception here is the correlation of a variable on itself, which will produce r=1.\" id=\"return-footnote-976-5\" href=\"#footnote-976-5\" aria-label=\"Footnote 5\"><sup class=\"footnote\">[5]<\/sup><\/a>, most variables&#8217;s associations are somewhere between 0 and\u00a0\u00b11.\u00a0 <strong>The closer a correlation is to 0, the weaker it is; the closer the correlation is to -1 or +1, the stronger it is.<\/strong> Typically, in the social sciences a correlation of about <em>r<\/em>=\u00b10.7 would be considered strong, a correlation of about <em>r<\/em>=\u00b10.5 would be considered moderate, and a correlation about <em>r<\/em>=\u00b10.3 would be considered weak. Correlations around \u00b10.8 or\u00a0\u00b10.9 would therefore be very strong, while associations around\u00a0\u00b10.2 and\u00a0\u00b10.1 would be quite weak.<\/p>\n<p>&nbsp;<\/p>\n<p>Now that you are well-equipped with knowledge about interpreting correlations, let&#8217;s see what the correlations of the associations discussed above were.<\/p>\n<p>&nbsp;<\/p>\n<p>First we looked at class attendance and test scores (Figures 7.3(A) and 7.3(B)); the correlation between the two variables was a very strong\u00a0<em>r<\/em>=0.881. Then, we looked at social media usage and test scores (Figures 7.4(A) and 7.4(B)), where the correlation was equally strong\u00a0<em>r<\/em>=-0.882<a class=\"footnote\" title=\"If you're wondering why the correlations appear to be of the same strength, the reason lies in the way I created the synthetic variable social media usage -- as an inversion of the simulated variable class attendance. I did warn you the data is made up as a heuristic. (Do not take this to mean that such associations -- between attendance and class performance and social media usage and test scores -- do not exist in real life.)\" id=\"return-footnote-976-6\" href=\"#footnote-976-6\" aria-label=\"Footnote 6\"><sup class=\"footnote\">[6]<\/sup><\/a>. Finally, we discussed the practically non-existent linear association between student number and test scores (of selected students, Figure 7.5) whose <em>r<\/em>=0.049, while the improperly imposed linearity in Figure 7.6 from the caveat had a similar so-weak-almost-zero linear correlation of <em>r<\/em>=-0.051.<\/p>\n<p>&nbsp;<\/p>\n<p>Tired of fake data? Ready to return to the real world of sociological research? Then let&#8217;s take a real example with existing data and see how it all works out.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em>Example 7.5<\/em>\u00a0<em>Intergenerational Reproduction of Privilege in Education in the USA (GSS 2018)<\/em><\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>&nbsp;<\/p>\n<p>For this example I usd data from the National Opinion Research Center&#8217;s (NORC) at the University of Chicago <em>General Social Survey (GSS) 2018<\/em>. I&#8217;m interested in exploring whether <em>father&#8217;s education<\/em> and the <em>education<\/em> of the respondent are potentially correlated. Both father&#8217;s education and education of the respondent are measured in years of schooling, ranging from 0 (no education) to 20 years. As such they are discrete ratio variables which we can treat as continuous due to their number of values being quite large (twenty-one to be precise). Figure 7.7 shows the relevant scatterplot.<\/p>\n<p>&nbsp;<\/p>\n<p><em>Figure 7.7\u00a0Respondent&#8217;s Years of Schooling<\/em><em style=\"text-indent: 1em;font-size: 1rem\">\u00a0by Father&#8217;s Years of Schooling (GSS, 2018)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-paeduc-educ-gss-line.png\" alt=\"\" width=\"462\" height=\"370\" class=\"wp-image-1034 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-paeduc-educ-gss-line.png 462w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-paeduc-educ-gss-line-300x240.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-paeduc-educ-gss-line-65x52.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-paeduc-educ-gss-line-225x180.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/scatterplot-paeduc-educ-gss-line-350x280.png 350w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>There are several thing to note in the graph above. One is that the data points look less &#8220;scattered&#8221; and more orderly arranged in neat rows and columns than would be the case, had we variables with much larger number of values. Furthermore, while <em>N<\/em>=1,687, there are much fewer data points on the scatterplot: the reason, of course, is that there are many observations &#8220;on top&#8221; of each other, i.e., most data points represent more than one person&#8217;s combination of their own years of education and their respective father&#8217;s years of education. (After all, most such combination are unlikely to be unique; we can arguably expect there to be more than one respondent and their father both having, say, 12 years of education in the dataset.)<\/p>\n<p>&nbsp;<\/p>\n<p>Substantively, however, what do we see in the scatterplot above? To the extent that there are respondents with low levels of education, they seem to have fathers with low levels of education too. As well, while respondents with higher levels of education seem to have fathers with all levels of education, those with higher parental education appear to be more than those with lower parental education. (That is, both the left and the right side of the upper half of the scatterplot have many observations, but the top right area do seem to contain more observations than the top left area). Finally, and most importantly, there seem to be almost no respondents with low levels of education whose fathers had high levels of education (note the empty bottom right area of the graph).<\/p>\n<p>&nbsp;<\/p>\n<p>All in all, it seems like more years of father&#8217;s education &#8220;go&#8221; with more years of respondent&#8217;s education, and fewer years of father&#8217;s education &#8220;go&#8221; with fewer years of respondent&#8217;s education &#8212; though not completely so, or the top left area of the graph (the less educated fathers with more educated offspring) would be empty too. This is reflected in the line of best fit whose slope, while positive, is not very steep.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Ultimately, the scatterplot indicates that <em>father&#8217;s education<\/em> and<em> respondent&#8217;s education<\/em> seem positively associated in the dataset but also that this association is not very strong.<\/strong> That is, there appears to be intergenerational reproduction of privilege in education, however, fortunately, one&#8217;s father&#8217;s lower levels of education don&#8217;t seem to completely preclude one&#8217;s own educational attainment.<\/p>\n<p>&nbsp;<\/p>\n<p>The correlation coefficient provides a numerical summary of the potential association described above.<\/p>\n<p>&nbsp;<\/p>\n<p><em>Table 7.4<\/em>\u00a0<em>Correlation between Father&#8217;s Years of Schooling and Respondent&#8217;s Years of Schooling (GSS 2018)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/correlation-paeduc-educ-gss.png\" alt=\"\" width=\"492\" height=\"262\" class=\"wp-image-1035 size-full aligncenter\" srcset=\"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/correlation-paeduc-educ-gss.png 492w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/correlation-paeduc-educ-gss-300x160.png 300w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/correlation-paeduc-educ-gss-65x35.png 65w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/correlation-paeduc-educ-gss-225x120.png 225w, https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-content\/uploads\/sites\/564\/2019\/03\/correlation-paeduc-educ-gss-350x186.png 350w\" sizes=\"auto, (max-width: 492px) 100vw, 492px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>SPSS&#8217;s output provides <em>r<\/em>\u00a0as &#8220;Pearson Correlation&#8221;, and here <em>r<\/em>=0.413. As suspected, this reflects positive a moderate\/moderately-weak association.<\/strong><a class=\"footnote\" title=\"Note that SPSS's bivariate correlation tables are 2x2 tables, with the information repeated twice. Thus, while four coefficients are provided in the central cells of the table, they are actually two pairs of the same two correlations. (That is, correlations are symmetric: correlating Variable 1 on Variable 2 is the same as correlating Variable 2 on Variable 1.) As well, one of these two pairs is always equal to 1, as a variable correlated on itself is a perfect correlation. This is shown in the table as corr(Highest year school completed, Highest years school completed, father)=0.413=(Highest year school completed, father, Highest years school completed) and corr(Highest year school completed, Highest year school completed)=1=(Highest year school completed, father, Highest year school completed, father).\" id=\"return-footnote-976-7\" href=\"#footnote-976-7\" aria-label=\"Footnote 7\"><sup class=\"footnote\">[7]<\/sup><\/a><\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<header><\/header>\n<header><\/header>\n<header class=\"textbox__header\">To summarize, you can describe and examine potential associations between continuous variables through scatterplots with lines of best fit (looking for a concordant or discordant pattern in the data points) and the coefficient of correlation <em>r<\/em> (ranging from 0 to\u00a0\u00b11 in strength, with 0 standing for\u00a0 no correlation and\u00a0\u00b11 constituting a perfect negative or a perfect positive correlation).<\/header>\n<header><\/header>\n<header><\/header>\n<header>Before we move on, the tip below shows how to get the visual and the numerical summary of continuous bivariate associations in SPSS.<\/header>\n<header><\/header>\n<header><\/header>\n<header>\n<div class=\"textbox textbox--key-takeaways\"><\/div>\n<\/header>\n<header class=\"textbox__header\">\n<p class=\"textbox__title\"><em>SPSS Tip 7.3\u00a0Scatterplot and Correlation Coefficient<\/em><\/p>\n<\/header>\n<p>&nbsp;<\/p>\n<p><strong>For Scatterplots:<\/strong><\/p>\n<ul>\n<li><span style=\"font-size: 1rem;text-indent: 0px\">From the <em>Main Menu<\/em> select <em>Graphs<\/em> and, from the pull-down menu, <em>Legacy Dialogues<\/em>; click on <em>Scatter\/Dot<\/em>;<\/span><\/li>\n<li>Keep the pre-selected <em>Simple Scatter<\/em> option and click <em>Define<\/em>;<\/li>\n<li>In the new window, select one by your variables of interest from the list on the left and, using the arrow buttons, move them to the <em>X-Axis<\/em> and <em>Y-Axis<\/em><a class=\"footnote\" title=\"At this point, it doesn't really matter which one you put in the X- or Y-Axis though I would suggest placing the variable that precedes the other in time (like father'd education generally precedes offspring's education) in the X-Axis. The reasons for this will be explained in Chapter 10.\" id=\"return-footnote-976-8\" href=\"#footnote-976-8\" aria-label=\"Footnote 8\"><sup class=\"footnote\">[8]<\/sup><\/a> empty spaces on the right; click <em>OK<\/em>.<\/li>\n<li>The <em>Output<\/em> window will show the resulting scatterplot; double-clicking on it will open a <em>Chart Editor<\/em> window from where you can change the text, colours, size, etc. of the graph to suit your needs.<\/li>\n<\/ul>\n<p><strong>For the correlation coefficient (Pearson&#8217;s<em> r<\/em>):<\/strong><\/p>\n<ul>\n<li>From the <em>Main Menu<\/em>, select <em>Analyze<\/em>;<\/li>\n<li>From the pull-down menu, select <em>Correlate<\/em> and then <em>Bivariate<\/em>;<\/li>\n<li>In the resulting window, select one at a time your two variables of interest from the list on the left and, using the arrow button, move them to the <em>Variables<\/em> space on the right (the order is not important); click <em>OK<\/em>.<\/li>\n<li>The <em>Output<\/em> window will display a symmetric 2&#215;2 table with your requested correlation coefficient.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-976-1\">Other than linear associations exists, e.g., <em>curvilinear<\/em> (imagine U-shaped or inverted U-shaped <em>curves<\/em> in the data, instead of a straight line). Analyzing these is more complicated and beyond the scope of this book. The discussion hereafter will consider only bivariate linear associations associations, regardless if I mention it explicitly or not.  <a href=\"#return-footnote-976-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><li id=\"footnote-976-2\">The simulated data used here for illustration purposes only is provided by DataBake (www.databake.io). [see terms of use 3.6, 3.7: (free) datasets can be copied, modified, stored or otherwise used for your own personal, academic, or internal business purposes\"] <a href=\"#return-footnote-976-2\" class=\"return-footnote\" aria-label=\"Return to footnote 2\">&crarr;<\/a><\/li><li id=\"footnote-976-3\">The data is called <em>simulated<\/em> as it's computer-generated for the purposes of the exercise. <a href=\"#return-footnote-976-3\" class=\"return-footnote\" aria-label=\"Return to footnote 3\">&crarr;<\/a><\/li><li id=\"footnote-976-4\">We discuss the line of best fit (aka regression line) in Chapter 10 later. <a href=\"#return-footnote-976-4\" class=\"return-footnote\" aria-label=\"Return to footnote 4\">&crarr;<\/a><\/li><li id=\"footnote-976-5\">The obvious exception here is the correlation of a variable on itself, which will produce <em>r<\/em>=1. <a href=\"#return-footnote-976-5\" class=\"return-footnote\" aria-label=\"Return to footnote 5\">&crarr;<\/a><\/li><li id=\"footnote-976-6\">If you're wondering why the correlations appear to be of the same strength, the reason lies in the way I created the synthetic variable <em>social media usage<\/em> -- as an inversion of the simulated variable <em>class attendance<\/em>. I did warn you the data is made up as a heuristic. (Do not take this to mean that such associations -- between attendance and class performance and social media usage and test scores -- do not exist in real life.)  <a href=\"#return-footnote-976-6\" class=\"return-footnote\" aria-label=\"Return to footnote 6\">&crarr;<\/a><\/li><li id=\"footnote-976-7\">Note that SPSS's bivariate correlation tables are 2x2 tables, with the information repeated twice. Thus, while four coefficients are provided in the central cells of the table, they are actually two pairs of the same two correlations. (That is, correlations are symmetric: correlating <em>Variable 1<\/em> on <em>Variable 2<\/em> is the same as correlating <em>Variable 2<\/em> on <em>Variable 1<\/em>.) As well, one of these two pairs is always equal to 1, as a variable correlated on itself is a perfect correlation. This is shown in the table as <em>corr<\/em>(<em>Highest year school completed, Highest years school completed, father<\/em>)=0.413=(<em>Highest year school completed, father, Highest years school completed<\/em>) and <em>corr<\/em>(<em>Highest year school completed, Highest year school completed<\/em>)=1=(<em>Highest year school completed, father, Highest year school completed, father<\/em>). <a href=\"#return-footnote-976-7\" class=\"return-footnote\" aria-label=\"Return to footnote 7\">&crarr;<\/a><\/li><li id=\"footnote-976-8\">At this point, it doesn't really matter which one you put in the X- or Y-Axis though I would suggest placing the variable that precedes the other in time (like father'd education generally precedes offspring's education) in the X-Axis. The reasons for this will be explained in Chapter 10. <a href=\"#return-footnote-976-8\" class=\"return-footnote\" aria-label=\"Return to footnote 8\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":533,"menu_order":5,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-976","chapter","type-chapter","status-publish","hentry"],"part":34,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/976","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/users\/533"}],"version-history":[{"count":25,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/976\/revisions"}],"predecessor-version":[{"id":2140,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/976\/revisions\/2140"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/parts\/34"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapters\/976\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/media?parent=976"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/pressbooks\/v2\/chapter-type?post=976"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/contributor?post=976"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/simplestats\/wp-json\/wp\/v2\/license?post=976"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}