{"id":504,"date":"2021-08-13T00:35:37","date_gmt":"2021-08-13T04:35:37","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/statspsych\/?post_type=chapter&#038;p=504"},"modified":"2022-06-04T03:30:10","modified_gmt":"2022-06-04T07:30:10","slug":"chapter-10","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/statspsych\/chapter\/chapter-10\/","title":{"raw":"10. Correlation and Regression","rendered":"10. Correlation and Regression"},"content":{"raw":"<h1>10a. Correlation<\/h1>\r\nThis chapter marks a big shift from the inferential techniques we have learned to date. Here we will be looking at relationships between two numeric variables, rather than analyzing the differences between the means of two or more experimental groups.\r\n\r\n<strong>[pb_glossary id=\"505\"]Correlation[\/pb_glossary]<\/strong> is used to test the direction and strength of the relationships between two numerical variables. We will see how scatterplots can be used to plot variable X against variable Y to detect linear relationships. The slop of the linear relationship can be positive or negative, which reveals systematic patterns in how the two variables co-relate. We will also look at the theory of <strong>correlational<\/strong> analysis, including some cautions around interpreting the results of <strong>correlational<\/strong> analyses. Thanks to the third variable problem, <strong>correlation<\/strong> does NOT equal causation, a mantra that should be familiar from your introductory psychology courses. And finally, we will try calculating correlation by partitioning <strong>[pb_glossary id=\"507\"]covariance[\/pb_glossary]<\/strong>, and put it all into practice in a hypothesis test. Later in the chapter, we will build in <strong>[pb_glossary id=\"509\"]regression[\/pb_glossary]<\/strong>, which allows us to predict the future from the past.\r\n\r\nJust like a bar graph is helpful to examine visually the differences among means, a scatterplot allows us to visualize the pattern that represents the relationship between two numeric variables, X vs. Y.\r\n\r\nIf the trend line that best indicates the linear pattern in the scatter plot has an upward slope, we consider that a positive directionality.<img class=\"wp-image-512 alignright\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-1024x856.jpg\" alt=\"\" width=\"347\" height=\"290\" \/>\r\n\r\nTo find out if there appears to be a positive correlation, you can ask yourself \u201care those that score high on one variable likely to score high on the other?\u201d Here we see an example: what is the relationship between feline friendliness and number of scritches received? As you can see, when cat friendliness is high, the cuddles received is also high. There is a clear positive trend line. This make sense \u2013 people may be more likely to offer cuddles to a cat that solicits them.\r\n\r\n<img class=\" wp-image-513 alignleft\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-1024x878.jpg\" alt=\"\" width=\"327\" height=\"280\" \/>\r\n\r\nA downward slope indicates a negative directionality. To find out if there appears to be a negative correlation, you can ask yourself \u201care those that score high on one variable likely to score low on the other?\u201d Here you can see an example: what is the relationship between feline aloofness and the number of scritches received? There is a clear negative trend line. This makes logical sense, because people may be less likely to offer cuddles to a cat that keeps to itself.\r\n\r\nWhen we look at a scatter plot, we want to ask ourselves two questions: one about the apparent strength of the relationship between the variables, and the other about the direction of the relationship. Let us take a look at a few examples.\r\n\r\n<img class=\"aligncenter wp-image-515\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-1024x879.jpg\" alt=\"\" width=\"473\" height=\"406\" \/>\r\n\r\nIn graph A), if we ask \u201care variables X and Y strongly or weakly related?\u201d We would say strongly related. This is because the points on the scatter plot are in a perfect line. There is no distance between the points and the trend line. It is a perfect <strong>correlation<\/strong>. If we ask \u201cis the trend line positive or negative in slope?\u201d We would say that it is negative in slope. As scores on variable X increase, scores on variable Y do the opposite \u2013 they decrease. We might expect such a relationship if we plotted speed against time. The faster something is, the less time it takes. In the next example, graph B), if we ask \u201care variables X and Y strongly or weakly related?\u201d We would say weakly related. There is no clear linear trend that can be visually discerned \u2013 it just looks like a random scatter of dots. With no trend line, the question about directionality is irrelevant. This <strong>correlation<\/strong> is close to zero, so it is neither positive nor negative as a directional relationship. If we look at example C), the strength is not quite as perfect as in the first example, but the dots would not be very distant from a trend line through them, so this would be a fairly strong <strong>correlation<\/strong>. As scores on variable X go up, so do those on variable Y, making this a positive <strong>correlation<\/strong>. Now it is your turn. In example D), would this be a strong relationship or weak? Or somewhere in between? And do you see a positive or a negative slope to a trendline that runs through the cloud?\r\n\r\n[h5p id=\"79\"]\r\n\r\n<strong>Correlational<\/strong> analysis seeks to answer the question \u201chow closely related are two variables.\u201d This is a very useful analytical approach when we have two numeric variables and we wish to analyze the patterns in how they co-vary. However, <strong>correlational<\/strong> analyses have limitations that it is vital to be aware of.\r\n\r\nFirst, the <strong>correlational<\/strong> method we will cover in this course is only capable of detecting linear relationships. Patterns that have a curve to them will not be captured by the <strong>correlation<\/strong> formula we will use.\r\n\r\nSecondly, <strong>correlation<\/strong> does <em>not<\/em> equal causation. <strong>Correlational<\/strong> research designs do not allow for causal interpretations, because the third variable problem renders <strong>correlational<\/strong> analyses vulnerable to spurious results. When we measure two variables at the same time and plot them against each other, what we can do is <em>describe<\/em> their relationship. We can even test whether the strength of their relationship is significantly different from zero. However, we cannot determine whether X <em>causes<\/em> Y.\r\n\r\n<img class=\"wp-image-517 alignright\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-1024x869.jpg\" alt=\"\" width=\"358\" height=\"304\" \/>For example, if we measured the consumption of ice cream as well as drowning deaths on a sample of days throughout the year, we might determine that there is a strong positive relationship between the two variables. Consumption of ice cream and drowning deaths are apparently closely related phenomena. But does consumption of ice cream cause drowning deaths? That seems a little far fetched. Could there be another explanation for the pattern? Is there a third variable that could in fact explain the trends in each of the two variables measured here? What might cause people to consume more ice cream as well as put themselves at greater risk for drowning? Warm weather perhaps? If we were to plot temperate against ice cream and drowning deaths, would we see positive correlations with each? Very likely. With this third variable connecting the two, it would be a logical fallacy to interpret the apparent <strong>correlation<\/strong> shown here as meaningful. But then again, is it possible that consuming ice cream could be a risk factor for drowning? Did an elder ever tell you that you should not eat right before swimming, because you might cramp up and drown? Maybe there is some truth to that.\r\n\r\nSo how could we find out whether there is a true causal relationship between two variables? In order to make cause-effect conclusions, we must use an experimental design. Two major features of experimental research designs eliminate the logical fallacies associated with <strong>correlations<\/strong>. First, an experiment makes use of random assignment of participants to conditions, because that controls for extraneous variables like the third variable of temperature in this example. And secondly, an experiment manipulates the independent variable, to establish a cause, and then measure effects.\r\n<div class=\"textbox textbox--learning-objectives\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\">Requirements for cause-effect conclusions<\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\nA true experiment requires the following elements in order to control for extraneous variables and establish cause-effect directionality:\r\n<ul>\r\n \t<li>random assignment of participants to conditions (or randomization of order of conditions in repeated measures designs)<\/li>\r\n \t<li>manipulation of the independent variable<\/li>\r\n<\/ul>\r\n<\/div>\r\n<\/div>\r\nIn our ice cream and drowning study here, how could we make it into an experiment, to allow for causal conclusions? First we would have to assign our participants randomly into the experimental and control groups. There must be no systematic bias in who is given ice cream and who is not. Second, we would have to manipulate independent variable \u2013 we would have to have those participants in the experimental group eat ice cream. Then we would put all participants in water, at the same temperature, and see how many of them drown. We would calculate the average number of drowning events in the ice-cream-eating vs. the control group, and run a t-test or ANOVA to find out if they are significantly different from each other.\r\n\r\nOf course, you might be thinking, \u201cwould this be ethical?\u201d At least I hope you are thinking that. Of course not! It would not make sense to allow people to drown, just to answer this empirical question. In fact, that is exactly why <strong>correlation<\/strong> exists.\r\n\r\n[caption id=\"attachment_521\" align=\"alignleft\" width=\"300\"]<img class=\"wp-image-521 size-medium\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-300x261.png\" alt=\"\" width=\"300\" height=\"261\" \/> <em>\"Herp Derp :D\"\u00a0by\u00a0O hai :3\u00a0is licensed under\u00a0CC BY 2.0<\/em>[\/caption]\r\n\r\nOften, practical or ethical limitations make an experiment prohibitively difficult or impossible. If we are limited to <strong>correlational<\/strong> techniques in a particular research study, then we simply cannot draw cause-effect inferences.\r\n\r\nSo, a major take-home point of this lesson is\u2026 don\u2019t be like this guy.\r\n\r\n&nbsp;\r\n\r\n[h5p id=\"80\"]\r\n\r\nOkay, so how do we go about calculating <strong>correlation<\/strong>? Well, similar to ANOVA, we can think of the process conceptually as the partitioning of variance. But this time, what counts as good variance is <strong>covariance<\/strong>. This is the systematic variance that both variables X and Y have in common. Because it is variance that is explained by the co-relation of the two variables, we will put <strong>covariance<\/strong> in the \u201cgood\u201d bucket. The random variance that is unexplained by the relationship between X and Y, the distance between the dots and the trend line, that is the variance that we will put in the \u201cbad\u201d bucket. A conceptual formula for the correlation coefficient <strong>[pb_glossary id=\"523\"]r[\/pb_glossary]<\/strong> would be covariablity of X and Y divided by the variability of X and Y separately.\r\n\r\n<img class=\"aligncenter wp-image-527\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.6.png\" alt=\"Conceptual formula for r\" width=\"438\" height=\"81\" \/>\r\n\r\nOnce we find <strong>r<\/strong>, another statistic that provide helpful information is <strong>[pb_glossary id=\"524\"]r squared[\/pb_glossary]<\/strong>. <strong>r<sup>2<\/sup><\/strong> is the proportion of variability in one variable that can be explained by the relationship with the other variable. Make note of this fact, because the proportion of variability explained by a <strong>correlation<\/strong> is a very helpful metric.\r\n\r\n&nbsp;\r\n\r\n[h5p id=\"82\"]\r\n\r\nNow we can examine what form a hypothesis test would take in the context of a <strong>correlational<\/strong> research design. Such a hypothesis test asks the question, \u201chow unlikely is it that the <strong>correlation<\/strong> coefficient is actually zero?\u201d\r\n\r\nIn step 1, in order to keep the hypothesis in a form similar to what we did before, we can identify the populations in a particular way. Population 1 will be \"people like those in the sample,\" and population 2 will be \"people who show no relationship between the variables.\" That way, the research hypothesis can be set up as \"The correlation for population 1 is [greater than\/less than\/different from] the correlation for population 2. The null hypothesis can be \"The correlation for population 1 is the same as the correlation for population 2.\"\r\n\r\nIn step 2, we need to find the characteristics of the comparison distribution, and in this case we need the correlation coefficient <strong>r<\/strong>, which can range from -1 to 1. An <strong>r<\/strong> value of 0 indicates there is no <strong>correlation<\/strong> whatsoever between the two measured variables. An <strong>r<\/strong> of 1 is a perfect positive <strong>correlation<\/strong>, and an <strong>r<\/strong> of -1 is a perfect negative correlation. Most <strong>correlations<\/strong> in real life fall closer to 0 than to 1 or -1.\r\n\r\n[latex] \\[r=\\frac{\\sum (Z_{X}\\times Z_{Y})}{N}\\] [\/latex]\r\n\r\nThis correlation coefficient formula makes use of Z-scores, which is a great way to review these standardized scores covered in an earlier chapter.\u00a0 Recall that\r\n\r\n[latex] \\[Z=\\frac{X-M}{SD}\\] [\/latex] , where [latex] \\[SD=\\frac{\\sum (X-M)^{2}}{N}\\] [\/latex]\r\n\r\nFor each variable, X and Y, we must calculate the mean and standard deviation of the variable, so each score can be translated to Z-scores. Only then can they be cross-multiplied and then summed in the <strong>r<\/strong> formula.\r\n\r\nOnce we calculate the <strong>r<\/strong> value for a <strong>correlation<\/strong>, we can test the statistical significance of this value, based on how extreme it is on the t distribution. An <strong>r<\/strong> of 0 is placed in the centre of the t distribution, as the comparison distribution mean, and positive one and negative one are placed at either tail of the distribution.\r\n\r\n<img class=\"aligncenter wp-image-530\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-1024x523.jpg\" alt=\"\" width=\"489\" height=\"250\" \/>\r\n\r\nThe further out we get into the appropriate tail, the better our chance of rejecting the null hypothesis of a zero correlation. The bad news is, we are back to the t-test, which means we have to think about directionality. The good news is, this is a great opportunity to refresh ourselves on how the t-test works.\r\n\r\nIn step 3, we find the cutoff score using the <a href=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/back-matter\/t-distribution-tables\/\" target=\"_blank\" rel=\"noopener\">t tables<\/a>. For correlation degrees of freedom will be <em>N<\/em>-2, where N is the number of people in the sample. This is so, because we have two measured (numeric) variables, each of which has <em>N<\/em>-1 scores that are free to vary.\r\n\r\nIn step 4, the t-test is calculated as <strong>r<\/strong> divided by <em>S<sub>r<\/sub><\/em>, where <em>S<sub>r<\/sub><\/em>\u00a0quantifies the unexplained variability.\r\n\r\n[latex] \\[t=\\frac{r}{S_{r}}\\] [\/latex]\r\n\r\nStep 5 is the decision: we reject the null hypothesis if the t-test result falls in the shaded tail beyond the cutoff.\r\n\r\n<img class=\"aligncenter wp-image-400\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-1024x631.jpg\" alt=\"\" width=\"403\" height=\"249\" \/>\r\n\r\nWe could express our hypothesis test results on the relationship between income and grades in the following manner:\r\n\r\n<span class=\"pullquote-left\">\u201cWe found that there was a significant positive correlation between family income and student grade average (r = 0.65, t<sub>11<\/sub> = 2.97, p &lt; 0.05).\u201d<\/span>\r\n\r\nNotice that our interpretation is <em>not<\/em> that we found higher family income results in a higher grade average. Why not? Well, as we said before, causal conclusions require experimental design. To draw such a conclusion regarding the relationship between family income and student grade average, we would need to randomly assign students into family income conditions, wealthy or poor, then measure the effects of that manipulation on their grades. Just like our drowning example, this seems not only logistically challenging, but also rather unethical. So, we are limited to <strong>correlation<\/strong> here for a reason, and thus we simply need to characterize our findings as a relationship or pattern, rather than a statement of cause and effect.\r\n\r\nAs we put the final branch to our decision tree, we now have a decision flow for the situation of no independent variables. If both variables are numerical, you must use correlation to test their relationship.\r\n\r\n<img class=\"aligncenter size-large wp-image-534\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-1024x376.jpg\" alt=\"\" width=\"1024\" height=\"376\" \/>\r\n<h1>10b. Regression<\/h1>\r\nIn the next part of the chapter, we will examine the statistical technique of <strong>regression<\/strong>. <strong>Regression<\/strong> allows us to extend the findings of a <strong>correlation<\/strong> to predict the future from the past.\r\n\r\nOnce we have calculated a <strong>correlation<\/strong>, a <strong>regression<\/strong> allows us to predict how an individual would perform on one variable based on their performance on another variable. In an example of a correlation between income and grades, the <strong>regression<\/strong> would allow us to see what grade level would be achieved by an individual with a family income level that was not actually collected in our dataset. We could also identify the income level based on a given grade level.\r\n\r\nThe <strong>regression<\/strong> line is a line through our scatter plot that can be described with an equation. The equation has two components: slope and intercept. The slope says how many units up (or down) the line goes for each unit over.\u00a0 The intercept says where the line hits the y axis.\r\n\r\nThe <strong>regression<\/strong> line is a line that \u201cbest fits\u201d the data points that we have collected. Mathematically, it is the line that minimizes the squared deviations (i.e. error) of the individual points from the line.\r\n\r\n<img class=\"aligncenter wp-image-536\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.9-.jpg\" alt=\"\" width=\"342\" height=\"292\" \/>\r\n\r\nTo find the equation for the <strong>regression<\/strong> line, you can calculate slope <em>b<\/em> and then intercept <em>a<\/em> using the formulas shown.\r\n\r\n[latex] \\[b=\\frac{(X-M_{X})(Y-M_{Y})}{SS_{X}}\\] [\/latex]\r\n\r\n&nbsp;\r\n<div class=\"textbox textbox--exercises\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\">Steps to find b, slope of regression line<\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<ol>\r\n \t<li>\r\n<div>For each individual, find the deviation of the X score from the mean.*<\/div><\/li>\r\n \t<li>\r\n<div>For each individual, find the deviation of the Y score from the mean.*<\/div><\/li>\r\n \t<li>\r\n<div>For each individual, multiply the deviation of X by the corresponding deviation of Y<\/div><\/li>\r\n \t<li>\r\n<div>Add together the products from step 3 for all individuals.<\/div><\/li>\r\n \t<li>\r\n<div>Divide this sum by SS<sub>x<\/sub>.*<\/div>\r\n<span style=\"font-size: 1em\">*These calculations should already be completed for correlation.<\/span><\/li>\r\n<\/ol>\r\n<\/div>\r\n<\/div>\r\n[latex] \\[a=M_{Y}-(b)M_{X}\\] [\/latex]\r\n\r\nOnce <em>a<\/em> and <em>b<\/em> are calculated, we can plug these numbers into the <strong>regression<\/strong> line equation.\r\n\r\n[latex] \\[\\widehat{Y}=a+b(X)\\] [\/latex]\r\n\r\nHere I will show you the regression line equation for our family income vs. grade example.\r\n\r\n<img class=\"aligncenter wp-image-537\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-1024x877.jpg\" alt=\"\" width=\"489\" height=\"419\" \/>\r\n\r\n<em>b<\/em> is 0.11, which means that for every one unit of Family Income, the line goes up 0.11 unit of Average Grade. <em>a<\/em> is 77.96, which means that the line meets the y axis at a height of 77.96.\r\n\r\n&nbsp;\r\n\r\n[h5p id=\"83\"]\r\n\r\nThe line equation allows us to plot the precise <strong>regression<\/strong> line on the scatter plot. To plot a regression line, pick two X values that are on the low and the high end of the scale. Plug those into the line equation to find the corresponding Y values that are on the line.\r\n\r\nUsing the regression line, you can predict X value from Y values and Y values from X values. This means that even if you did not have someone in your dataset with a family income of 105, you can figure out what a student\u2019s average grade would have been if they had that family income. Likewise, if you had no one in your dataset with an average grade of 75, you can figure out what their family income would have been if they had that grade. Note that these are just predictions. They are imperfect, and do not take into account other factors or individual variability.\r\n\r\nHere we will try try predicting the average grade (Y) for a student who has a family income of 200. To do this, we will plug 200 in for X in the regression line equation (as shown here).\r\n\r\n[latex] \\[\\widehat{Y}=77.96 +0.11(X) \\] [\/latex]\r\n\r\n[latex] \\[\\widehat{Y}=77.96 +0.11(200) \\] [\/latex]\r\n\r\n[latex] \\[\\widehat{Y}=100.55 \\] [\/latex]\r\n\r\n<img class=\"aligncenter wp-image-538\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-1024x863.jpg\" alt=\"\" width=\"491\" height=\"414\" \/>\r\n\r\nThe result is a grade of 100.55. Of course getting a grade average above 100% is impossible (at least at many institutions). In this case, our prediction shows a \u201cceiling effect\u201d. This means that there is a maximum average grade that we hit before we hit a maximum family income. Therefore, the <strong>regression<\/strong> line equation becomes useless above a family income of around 190.\r\n\r\nNow, we can try predicting family income (Y) for a student with an average grade of 60 (X). To do this, you must plug in 60 for Y in the equation, then solve for X.\r\n\r\n[latex] \\[\\widehat{Y}=77.96 +0.11(X) \\] [\/latex]\r\n\r\n[latex] \\[60=77.96 +0.11(X) \\] [\/latex]\r\n\r\nNotice that to rearrange the equation to solve for X, you first have to move intercept (a) over:\r\n\r\n[latex] \\[(60-77.96)=0.11(X) \\] [\/latex]\r\n\r\nThen you have to divide by the slope:\r\n\r\n[latex]\\[\\frac{(60-77.96)}{0.11}=X\\] [\/latex]\r\n\r\nNow you are ready to solve for X: -159.\u00a0The result of finding X for the Y of 60 is a negative income!\u00a0This is, of course, impossible (or very unlikely). Here we can see the floor effect.\r\n\r\n<img class=\"aligncenter wp-image-535\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-1024x586.jpg\" alt=\"\" width=\"665\" height=\"381\" \/>\r\n\r\nThis means that there is a minimum family income that we reach before reaching the minimum grade. So the <strong>regression<\/strong> line becomes useless below an average grade of 77.96 (the Y intercept). Floor and ceiling effects are common problems for <strong>regression<\/strong>, and you should watch out for these problems when you use this technique. We can see that the <strong>regression<\/strong> line for this particular dataset is useful to make predictions for the grade of 80-100 average grade and the range of 0-190 income level.\r\n\r\n[h5p id=\"81\"]\r\n\r\nNow of course, predictions are not perfect. <strong>Regression<\/strong> allows for a prediction of one variable from another variable. As we can see in our scatterplots, not every real data point is exactly on the regression line. The actual data point might be different. Why is that? Because, unless it\u2019s a perfect <strong>correlation<\/strong>, some variability in the real data is not accounted for by the regression equation. We can estimate just how accurate our predictions are by looking at <strong>r squared<\/strong>. <strong>r<sup>2 <\/sup><\/strong>is the proportion of variance in one variable explained by its relationship with the other variable. The rest is the amount that is not accounted for.\r\n\r\nJust as we can include multiple factors in ANOVA, we can also include multiple predictive variables in a <strong>regression<\/strong>. We will not attempt that in this course, but if you take more advanced statistics course you will see that the more variables you include, each explaining a piece of the variability in the criterion variable, the more precise your <strong>regression<\/strong> model will become. Here, we are using just one predictive variable, and our <strong>r<sup>2<\/sup><\/strong> is likely to be well shy of 100% explained variance. So in that case, we can expect our <strong>regression<\/strong> to be only modestly accurate.\r\n<h1>Chapter Summary<\/h1>\r\nThis chapter introduced you to the statistical techniques of <strong>correlation<\/strong> and <strong>regression<\/strong>. We saw how we can detect and describe the strength and direction of the relationship between two numeric variables, and to run a hypothesis test to find out if the <strong>correlation<\/strong> is significantly different from zero. Finally, we saw that <strong>regression<\/strong> can generate a linear model allow for the prediction of one variable from the other. A key reminder: <strong>correlation<\/strong> does <em>not<\/em> equal causation. These techniques suit research designs that do not meet the requirements of experimental design, and as such, our conclusions regarding the statistical findings must avoid cause-effect language.\r\n\r\nKey terms:\r\n<table class=\"no-lines\" style=\"border-collapse: collapse;width: 100%\" border=\"0\">\r\n<tbody>\r\n<tr>\r\n<td style=\"width: 33.3333%\"><strong>correlation<\/strong><\/td>\r\n<td style=\"width: 33.3333%\"><strong>regression<\/strong><\/td>\r\n<td style=\"width: 33.3333%\"><strong>r squared<\/strong><\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 33.3333%\"><strong>covariance<\/strong><\/td>\r\n<td style=\"width: 33.3333%\"><strong>r<\/strong><\/td>\r\n<td style=\"width: 33.3333%\"><\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>","rendered":"<h1>10a. Correlation<\/h1>\n<p>This chapter marks a big shift from the inferential techniques we have learned to date. Here we will be looking at relationships between two numeric variables, rather than analyzing the differences between the means of two or more experimental groups.<\/p>\n<p><strong><a class=\"glossary-term\" aria-haspopup=\"dialog\" aria-describedby=\"definition\" href=\"#term_504_505\">Correlation<\/a><\/strong> is used to test the direction and strength of the relationships between two numerical variables. We will see how scatterplots can be used to plot variable X against variable Y to detect linear relationships. The slop of the linear relationship can be positive or negative, which reveals systematic patterns in how the two variables co-relate. We will also look at the theory of <strong>correlational<\/strong> analysis, including some cautions around interpreting the results of <strong>correlational<\/strong> analyses. Thanks to the third variable problem, <strong>correlation<\/strong> does NOT equal causation, a mantra that should be familiar from your introductory psychology courses. And finally, we will try calculating correlation by partitioning <strong><a class=\"glossary-term\" aria-haspopup=\"dialog\" aria-describedby=\"definition\" href=\"#term_504_507\">covariance<\/a><\/strong>, and put it all into practice in a hypothesis test. Later in the chapter, we will build in <strong><a class=\"glossary-term\" aria-haspopup=\"dialog\" aria-describedby=\"definition\" href=\"#term_504_509\">regression<\/a><\/strong>, which allows us to predict the future from the past.<\/p>\n<p>Just like a bar graph is helpful to examine visually the differences among means, a scatterplot allows us to visualize the pattern that represents the relationship between two numeric variables, X vs. Y.<\/p>\n<p>If the trend line that best indicates the linear pattern in the scatter plot has an upward slope, we consider that a positive directionality.<img loading=\"lazy\" decoding=\"async\" class=\"wp-image-512 alignright\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-1024x856.jpg\" alt=\"\" width=\"347\" height=\"290\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-1024x856.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-300x251.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-768x642.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-65x54.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-225x188.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1-350x293.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.1.jpg 1072w\" sizes=\"auto, (max-width: 347px) 100vw, 347px\" \/><\/p>\n<p>To find out if there appears to be a positive correlation, you can ask yourself \u201care those that score high on one variable likely to score high on the other?\u201d Here we see an example: what is the relationship between feline friendliness and number of scritches received? As you can see, when cat friendliness is high, the cuddles received is also high. There is a clear positive trend line. This make sense \u2013 people may be more likely to offer cuddles to a cat that solicits them.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-513 alignleft\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-1024x878.jpg\" alt=\"\" width=\"327\" height=\"280\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-1024x878.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-300x257.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-768x658.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-65x56.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-225x193.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2-350x300.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.2.jpg 1058w\" sizes=\"auto, (max-width: 327px) 100vw, 327px\" \/><\/p>\n<p>A downward slope indicates a negative directionality. To find out if there appears to be a negative correlation, you can ask yourself \u201care those that score high on one variable likely to score low on the other?\u201d Here you can see an example: what is the relationship between feline aloofness and the number of scritches received? There is a clear negative trend line. This makes logical sense, because people may be less likely to offer cuddles to a cat that keeps to itself.<\/p>\n<p>When we look at a scatter plot, we want to ask ourselves two questions: one about the apparent strength of the relationship between the variables, and the other about the direction of the relationship. Let us take a look at a few examples.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-515\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-1024x879.jpg\" alt=\"\" width=\"473\" height=\"406\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-1024x879.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-300x257.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-768x659.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-65x56.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-225x193.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3-350x300.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.3.jpg 1528w\" sizes=\"auto, (max-width: 473px) 100vw, 473px\" \/><\/p>\n<p>In graph A), if we ask \u201care variables X and Y strongly or weakly related?\u201d We would say strongly related. This is because the points on the scatter plot are in a perfect line. There is no distance between the points and the trend line. It is a perfect <strong>correlation<\/strong>. If we ask \u201cis the trend line positive or negative in slope?\u201d We would say that it is negative in slope. As scores on variable X increase, scores on variable Y do the opposite \u2013 they decrease. We might expect such a relationship if we plotted speed against time. The faster something is, the less time it takes. In the next example, graph B), if we ask \u201care variables X and Y strongly or weakly related?\u201d We would say weakly related. There is no clear linear trend that can be visually discerned \u2013 it just looks like a random scatter of dots. With no trend line, the question about directionality is irrelevant. This <strong>correlation<\/strong> is close to zero, so it is neither positive nor negative as a directional relationship. If we look at example C), the strength is not quite as perfect as in the first example, but the dots would not be very distant from a trend line through them, so this would be a fairly strong <strong>correlation<\/strong>. As scores on variable X go up, so do those on variable Y, making this a positive <strong>correlation<\/strong>. Now it is your turn. In example D), would this be a strong relationship or weak? Or somewhere in between? And do you see a positive or a negative slope to a trendline that runs through the cloud?<\/p>\n<div id=\"h5p-79\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-79\" class=\"h5p-iframe\" data-content-id=\"79\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"Practice 10a.01. Correlation and Slope.\"><\/iframe><\/div>\n<\/div>\n<p><strong>Correlational<\/strong> analysis seeks to answer the question \u201chow closely related are two variables.\u201d This is a very useful analytical approach when we have two numeric variables and we wish to analyze the patterns in how they co-vary. However, <strong>correlational<\/strong> analyses have limitations that it is vital to be aware of.<\/p>\n<p>First, the <strong>correlational<\/strong> method we will cover in this course is only capable of detecting linear relationships. Patterns that have a curve to them will not be captured by the <strong>correlation<\/strong> formula we will use.<\/p>\n<p>Secondly, <strong>correlation<\/strong> does <em>not<\/em> equal causation. <strong>Correlational<\/strong> research designs do not allow for causal interpretations, because the third variable problem renders <strong>correlational<\/strong> analyses vulnerable to spurious results. When we measure two variables at the same time and plot them against each other, what we can do is <em>describe<\/em> their relationship. We can even test whether the strength of their relationship is significantly different from zero. However, we cannot determine whether X <em>causes<\/em> Y.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-517 alignright\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-1024x869.jpg\" alt=\"\" width=\"358\" height=\"304\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-1024x869.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-300x255.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-768x652.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-65x55.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-225x191.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4-350x297.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.4.jpg 1063w\" sizes=\"auto, (max-width: 358px) 100vw, 358px\" \/>For example, if we measured the consumption of ice cream as well as drowning deaths on a sample of days throughout the year, we might determine that there is a strong positive relationship between the two variables. Consumption of ice cream and drowning deaths are apparently closely related phenomena. But does consumption of ice cream cause drowning deaths? That seems a little far fetched. Could there be another explanation for the pattern? Is there a third variable that could in fact explain the trends in each of the two variables measured here? What might cause people to consume more ice cream as well as put themselves at greater risk for drowning? Warm weather perhaps? If we were to plot temperate against ice cream and drowning deaths, would we see positive correlations with each? Very likely. With this third variable connecting the two, it would be a logical fallacy to interpret the apparent <strong>correlation<\/strong> shown here as meaningful. But then again, is it possible that consuming ice cream could be a risk factor for drowning? Did an elder ever tell you that you should not eat right before swimming, because you might cramp up and drown? Maybe there is some truth to that.<\/p>\n<p>So how could we find out whether there is a true causal relationship between two variables? In order to make cause-effect conclusions, we must use an experimental design. Two major features of experimental research designs eliminate the logical fallacies associated with <strong>correlations<\/strong>. First, an experiment makes use of random assignment of participants to conditions, because that controls for extraneous variables like the third variable of temperature in this example. And secondly, an experiment manipulates the independent variable, to establish a cause, and then measure effects.<\/p>\n<div class=\"textbox textbox--learning-objectives\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\">Requirements for cause-effect conclusions<\/p>\n<\/header>\n<div class=\"textbox__content\">\n<p>A true experiment requires the following elements in order to control for extraneous variables and establish cause-effect directionality:<\/p>\n<ul>\n<li>random assignment of participants to conditions (or randomization of order of conditions in repeated measures designs)<\/li>\n<li>manipulation of the independent variable<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p>In our ice cream and drowning study here, how could we make it into an experiment, to allow for causal conclusions? First we would have to assign our participants randomly into the experimental and control groups. There must be no systematic bias in who is given ice cream and who is not. Second, we would have to manipulate independent variable \u2013 we would have to have those participants in the experimental group eat ice cream. Then we would put all participants in water, at the same temperature, and see how many of them drown. We would calculate the average number of drowning events in the ice-cream-eating vs. the control group, and run a t-test or ANOVA to find out if they are significantly different from each other.<\/p>\n<p>Of course, you might be thinking, \u201cwould this be ethical?\u201d At least I hope you are thinking that. Of course not! It would not make sense to allow people to drown, just to answer this empirical question. In fact, that is exactly why <strong>correlation<\/strong> exists.<\/p>\n<figure id=\"attachment_521\" aria-describedby=\"caption-attachment-521\" style=\"width: 300px\" class=\"wp-caption alignleft\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-521 size-medium\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-300x261.png\" alt=\"\" width=\"300\" height=\"261\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-300x261.png 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-1024x891.png 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-768x668.png 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-65x57.png 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-225x196.png 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5-350x305.png 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.5.png 1203w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><figcaption id=\"caption-attachment-521\" class=\"wp-caption-text\"><em>&#8220;Herp Derp :D&#8221;\u00a0by\u00a0O hai :3\u00a0is licensed under\u00a0CC BY 2.0<\/em><\/figcaption><\/figure>\n<p>Often, practical or ethical limitations make an experiment prohibitively difficult or impossible. If we are limited to <strong>correlational<\/strong> techniques in a particular research study, then we simply cannot draw cause-effect inferences.<\/p>\n<p>So, a major take-home point of this lesson is\u2026 don\u2019t be like this guy.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"h5p-80\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-80\" class=\"h5p-iframe\" data-content-id=\"80\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"Practice 10a.02. Why correlational research?\"><\/iframe><\/div>\n<\/div>\n<p>Okay, so how do we go about calculating <strong>correlation<\/strong>? Well, similar to ANOVA, we can think of the process conceptually as the partitioning of variance. But this time, what counts as good variance is <strong>covariance<\/strong>. This is the systematic variance that both variables X and Y have in common. Because it is variance that is explained by the co-relation of the two variables, we will put <strong>covariance<\/strong> in the \u201cgood\u201d bucket. The random variance that is unexplained by the relationship between X and Y, the distance between the dots and the trend line, that is the variance that we will put in the \u201cbad\u201d bucket. A conceptual formula for the correlation coefficient <strong><a class=\"glossary-term\" aria-haspopup=\"dialog\" aria-describedby=\"definition\" href=\"#term_504_523\">r<\/a><\/strong> would be covariablity of X and Y divided by the variability of X and Y separately.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-527\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.6.png\" alt=\"Conceptual formula for r\" width=\"438\" height=\"81\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.6.png 703w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.6-300x55.png 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.6-65x12.png 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.6-225x42.png 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.6-350x65.png 350w\" sizes=\"auto, (max-width: 438px) 100vw, 438px\" \/><\/p>\n<p>Once we find <strong>r<\/strong>, another statistic that provide helpful information is <strong><a class=\"glossary-term\" aria-haspopup=\"dialog\" aria-describedby=\"definition\" href=\"#term_504_524\">r squared<\/a><\/strong>. <strong>r<sup>2<\/sup><\/strong> is the proportion of variability in one variable that can be explained by the relationship with the other variable. Make note of this fact, because the proportion of variability explained by a <strong>correlation<\/strong> is a very helpful metric.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"h5p-82\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-82\" class=\"h5p-iframe\" data-content-id=\"82\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"Practice 10a.03. Correlation, proportion of explained variability\"><\/iframe><\/div>\n<\/div>\n<p>Now we can examine what form a hypothesis test would take in the context of a <strong>correlational<\/strong> research design. Such a hypothesis test asks the question, \u201chow unlikely is it that the <strong>correlation<\/strong> coefficient is actually zero?\u201d<\/p>\n<p>In step 1, in order to keep the hypothesis in a form similar to what we did before, we can identify the populations in a particular way. Population 1 will be &#8220;people like those in the sample,&#8221; and population 2 will be &#8220;people who show no relationship between the variables.&#8221; That way, the research hypothesis can be set up as &#8220;The correlation for population 1 is [greater than\/less than\/different from] the correlation for population 2. The null hypothesis can be &#8220;The correlation for population 1 is the same as the correlation for population 2.&#8221;<\/p>\n<p>In step 2, we need to find the characteristics of the comparison distribution, and in this case we need the correlation coefficient <strong>r<\/strong>, which can range from -1 to 1. An <strong>r<\/strong> value of 0 indicates there is no <strong>correlation<\/strong> whatsoever between the two measured variables. An <strong>r<\/strong> of 1 is a perfect positive <strong>correlation<\/strong>, and an <strong>r<\/strong> of -1 is a perfect negative correlation. Most <strong>correlations<\/strong> in real life fall closer to 0 than to 1 or -1.<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-676b2fb961e34f77a42f21b945ea32bc_l3.png\" height=\"38\" width=\"138\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#114;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#92;&#115;&#117;&#109;&#32;&#40;&#90;&#95;&#123;&#88;&#125;&#92;&#116;&#105;&#109;&#101;&#115;&#32;&#90;&#95;&#123;&#89;&#125;&#41;&#125;&#123;&#78;&#125;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>This correlation coefficient formula makes use of Z-scores, which is a great way to review these standardized scores covered in an earlier chapter.\u00a0 Recall that<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 37px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-9ff4a1c7099641ee0b6856dfdc537f34_l3.png\" height=\"37\" width=\"95\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#90;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#88;&#45;&#77;&#125;&#123;&#83;&#68;&#125;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p> , where <\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 39px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-a3f3c5866e26173b6cd1a9ec6c9a7971_l3.png\" height=\"39\" width=\"151\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#83;&#68;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#92;&#115;&#117;&#109;&#32;&#40;&#88;&#45;&#77;&#41;&#94;&#123;&#50;&#125;&#125;&#123;&#78;&#125;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>For each variable, X and Y, we must calculate the mean and standard deviation of the variable, so each score can be translated to Z-scores. Only then can they be cross-multiplied and then summed in the <strong>r<\/strong> formula.<\/p>\n<p>Once we calculate the <strong>r<\/strong> value for a <strong>correlation<\/strong>, we can test the statistical significance of this value, based on how extreme it is on the t distribution. An <strong>r<\/strong> of 0 is placed in the centre of the t distribution, as the comparison distribution mean, and positive one and negative one are placed at either tail of the distribution.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-530\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-1024x523.jpg\" alt=\"\" width=\"489\" height=\"250\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-1024x523.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-300x153.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-768x392.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-1536x784.jpg 1536w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-65x33.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-225x115.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7-350x179.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.7.jpg 1602w\" sizes=\"auto, (max-width: 489px) 100vw, 489px\" \/><\/p>\n<p>The further out we get into the appropriate tail, the better our chance of rejecting the null hypothesis of a zero correlation. The bad news is, we are back to the t-test, which means we have to think about directionality. The good news is, this is a great opportunity to refresh ourselves on how the t-test works.<\/p>\n<p>In step 3, we find the cutoff score using the <a href=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/back-matter\/t-distribution-tables\/\" target=\"_blank\" rel=\"noopener\">t tables<\/a>. For correlation degrees of freedom will be <em>N<\/em>-2, where N is the number of people in the sample. This is so, because we have two measured (numeric) variables, each of which has <em>N<\/em>-1 scores that are free to vary.<\/p>\n<p>In step 4, the t-test is calculated as <strong>r<\/strong> divided by <em>S<sub>r<\/sub><\/em>, where <em>S<sub>r<\/sub><\/em>\u00a0quantifies the unexplained variability.<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 35px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-7b6465ff49e4c2e1bbe5176921054181_l3.png\" height=\"35\" width=\"51\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#116;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#114;&#125;&#123;&#83;&#95;&#123;&#114;&#125;&#125;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>Step 5 is the decision: we reject the null hypothesis if the t-test result falls in the shaded tail beyond the cutoff.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-400\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-1024x631.jpg\" alt=\"\" width=\"403\" height=\"249\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-1024x631.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-300x185.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-768x473.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-1536x947.jpg 1536w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-65x40.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-225x139.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6-350x216.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-6.6.jpg 1668w\" sizes=\"auto, (max-width: 403px) 100vw, 403px\" \/><\/p>\n<p>We could express our hypothesis test results on the relationship between income and grades in the following manner:<\/p>\n<p><span class=\"pullquote-left\">\u201cWe found that there was a significant positive correlation between family income and student grade average (r = 0.65, t<sub>11<\/sub> = 2.97, p &lt; 0.05).\u201d<\/span><\/p>\n<p>Notice that our interpretation is <em>not<\/em> that we found higher family income results in a higher grade average. Why not? Well, as we said before, causal conclusions require experimental design. To draw such a conclusion regarding the relationship between family income and student grade average, we would need to randomly assign students into family income conditions, wealthy or poor, then measure the effects of that manipulation on their grades. Just like our drowning example, this seems not only logistically challenging, but also rather unethical. So, we are limited to <strong>correlation<\/strong> here for a reason, and thus we simply need to characterize our findings as a relationship or pattern, rather than a statement of cause and effect.<\/p>\n<p>As we put the final branch to our decision tree, we now have a decision flow for the situation of no independent variables. If both variables are numerical, you must use correlation to test their relationship.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-534\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-1024x376.jpg\" alt=\"\" width=\"1024\" height=\"376\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-1024x376.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-300x110.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-768x282.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-1536x563.jpg 1536w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-2048x751.jpg 2048w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-65x24.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-225x83.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.8-350x128.jpg 350w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h1>10b. Regression<\/h1>\n<p>In the next part of the chapter, we will examine the statistical technique of <strong>regression<\/strong>. <strong>Regression<\/strong> allows us to extend the findings of a <strong>correlation<\/strong> to predict the future from the past.<\/p>\n<p>Once we have calculated a <strong>correlation<\/strong>, a <strong>regression<\/strong> allows us to predict how an individual would perform on one variable based on their performance on another variable. In an example of a correlation between income and grades, the <strong>regression<\/strong> would allow us to see what grade level would be achieved by an individual with a family income level that was not actually collected in our dataset. We could also identify the income level based on a given grade level.<\/p>\n<p>The <strong>regression<\/strong> line is a line through our scatter plot that can be described with an equation. The equation has two components: slope and intercept. The slope says how many units up (or down) the line goes for each unit over.\u00a0 The intercept says where the line hits the y axis.<\/p>\n<p>The <strong>regression<\/strong> line is a line that \u201cbest fits\u201d the data points that we have collected. Mathematically, it is the line that minimizes the squared deviations (i.e. error) of the individual points from the line.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-536\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.9-.jpg\" alt=\"\" width=\"342\" height=\"292\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.9-.jpg 675w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.9--300x256.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.9--65x55.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.9--225x192.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.9--350x299.jpg 350w\" sizes=\"auto, (max-width: 342px) 100vw, 342px\" \/><\/p>\n<p>To find the equation for the <strong>regression<\/strong> line, you can calculate slope <em>b<\/em> and then intercept <em>a<\/em> using the formulas shown.<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 41px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-fad54e7221f432eb7198066835ee491f_l3.png\" height=\"41\" width=\"194\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#98;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#88;&#45;&#77;&#95;&#123;&#88;&#125;&#41;&#40;&#89;&#45;&#77;&#95;&#123;&#89;&#125;&#41;&#125;&#123;&#83;&#83;&#95;&#123;&#88;&#125;&#125;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>&nbsp;<\/p>\n<div class=\"textbox textbox--exercises\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\">Steps to find b, slope of regression line<\/p>\n<\/header>\n<div class=\"textbox__content\">\n<ol>\n<li>\n<div>For each individual, find the deviation of the X score from the mean.*<\/div>\n<\/li>\n<li>\n<div>For each individual, find the deviation of the Y score from the mean.*<\/div>\n<\/li>\n<li>\n<div>For each individual, multiply the deviation of X by the corresponding deviation of Y<\/div>\n<\/li>\n<li>\n<div>Add together the products from step 3 for all individuals.<\/div>\n<\/li>\n<li>\n<div>Divide this sum by SS<sub>x<\/sub>.*<\/div>\n<p><span style=\"font-size: 1em\">*These calculations should already be completed for correlation.<\/span><\/li>\n<\/ol>\n<\/div>\n<\/div>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 19px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-f8f6f55d596e9ddb0554b59795c4668d_l3.png\" height=\"19\" width=\"135\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#97;&#61;&#77;&#95;&#123;&#89;&#125;&#45;&#40;&#98;&#41;&#77;&#95;&#123;&#88;&#125;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>Once <em>a<\/em> and <em>b<\/em> are calculated, we can plug these numbers into the <strong>regression<\/strong> line equation.<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 24px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-b9d6cc6a283d4133c43f98841b993e3d_l3.png\" height=\"24\" width=\"105\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#92;&#119;&#105;&#100;&#101;&#104;&#97;&#116;&#123;&#89;&#125;&#61;&#97;&#43;&#98;&#40;&#88;&#41;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>Here I will show you the regression line equation for our family income vs. grade example.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-537\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-1024x877.jpg\" alt=\"\" width=\"489\" height=\"419\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-1024x877.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-300x257.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-768x657.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-1536x1315.jpg 1536w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-65x56.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-225x193.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10-350x300.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.10.jpg 1654w\" sizes=\"auto, (max-width: 489px) 100vw, 489px\" \/><\/p>\n<p><em>b<\/em> is 0.11, which means that for every one unit of Family Income, the line goes up 0.11 unit of Average Grade. <em>a<\/em> is 77.96, which means that the line meets the y axis at a height of 77.96.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"h5p-83\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-83\" class=\"h5p-iframe\" data-content-id=\"83\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"Practice 10b.01 Regression line slope\"><\/iframe><\/div>\n<\/div>\n<p>The line equation allows us to plot the precise <strong>regression<\/strong> line on the scatter plot. To plot a regression line, pick two X values that are on the low and the high end of the scale. Plug those into the line equation to find the corresponding Y values that are on the line.<\/p>\n<p>Using the regression line, you can predict X value from Y values and Y values from X values. This means that even if you did not have someone in your dataset with a family income of 105, you can figure out what a student\u2019s average grade would have been if they had that family income. Likewise, if you had no one in your dataset with an average grade of 75, you can figure out what their family income would have been if they had that grade. Note that these are just predictions. They are imperfect, and do not take into account other factors or individual variability.<\/p>\n<p>Here we will try try predicting the average grade (Y) for a student who has a family income of 200. To do this, we will plug 200 in for X in the regression line equation (as shown here).<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 24px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-64c4fb675275e705fe7a5636fb29c8ae_l3.png\" height=\"24\" width=\"161\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#92;&#119;&#105;&#100;&#101;&#104;&#97;&#116;&#123;&#89;&#125;&#61;&#55;&#55;&#46;&#57;&#54;&#32;&#43;&#48;&#46;&#49;&#49;&#40;&#88;&#41;&#32;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 24px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-c2c25c175e908e3bfb940e2445190358_l3.png\" height=\"24\" width=\"172\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#92;&#119;&#105;&#100;&#101;&#104;&#97;&#116;&#123;&#89;&#125;&#61;&#55;&#55;&#46;&#57;&#54;&#32;&#43;&#48;&#46;&#49;&#49;&#40;&#50;&#48;&#48;&#41;&#32;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 19px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-5478805838c4dba007da6a15222b98fb_l3.png\" height=\"19\" width=\"87\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#92;&#119;&#105;&#100;&#101;&#104;&#97;&#116;&#123;&#89;&#125;&#61;&#49;&#48;&#48;&#46;&#53;&#53;&#32;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-538\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-1024x863.jpg\" alt=\"\" width=\"491\" height=\"414\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-1024x863.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-300x253.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-768x647.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-1536x1294.jpg 1536w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-65x55.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-225x190.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11-350x295.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.11.jpg 1671w\" sizes=\"auto, (max-width: 491px) 100vw, 491px\" \/><\/p>\n<p>The result is a grade of 100.55. Of course getting a grade average above 100% is impossible (at least at many institutions). In this case, our prediction shows a \u201cceiling effect\u201d. This means that there is a maximum average grade that we hit before we hit a maximum family income. Therefore, the <strong>regression<\/strong> line equation becomes useless above a family income of around 190.<\/p>\n<p>Now, we can try predicting family income (Y) for a student with an average grade of 60 (X). To do this, you must plug in 60 for Y in the equation, then solve for X.<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 24px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-64c4fb675275e705fe7a5636fb29c8ae_l3.png\" height=\"24\" width=\"161\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#92;&#119;&#105;&#100;&#101;&#104;&#97;&#116;&#123;&#89;&#125;&#61;&#55;&#55;&#46;&#57;&#54;&#32;&#43;&#48;&#46;&#49;&#49;&#40;&#88;&#41;&#32;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 19px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-4fef94388fd2f75d95be6efca47c809e_l3.png\" height=\"19\" width=\"164\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#54;&#48;&#61;&#55;&#55;&#46;&#57;&#54;&#32;&#43;&#48;&#46;&#49;&#49;&#40;&#88;&#41;&#32;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>Notice that to rearrange the equation to solve for X, you first have to move intercept (a) over:<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 19px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-db378e98cfb097f4c271595350a253e9_l3.png\" height=\"19\" width=\"176\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#32;&#92;&#091;&#40;&#54;&#48;&#45;&#55;&#55;&#46;&#57;&#54;&#41;&#61;&#48;&#46;&#49;&#49;&#40;&#88;&#41;&#32;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>Then you have to divide by the slope:<\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 38px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/ql-cache\/quicklatex.com-2b9344f36249d6767b62e35e67fb99b7_l3.png\" height=\"38\" width=\"135\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#091;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#54;&#48;&#45;&#55;&#55;&#46;&#57;&#54;&#41;&#125;&#123;&#48;&#46;&#49;&#49;&#125;&#61;&#88;&#92;&#093;&#32;\" title=\"Rendered by QuickLaTeX.com\" \/><\/p>\n<p>Now you are ready to solve for X: -159.\u00a0The result of finding X for the Y of 60 is a negative income!\u00a0This is, of course, impossible (or very unlikely). Here we can see the floor effect.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-535\" src=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-1024x586.jpg\" alt=\"\" width=\"665\" height=\"381\" srcset=\"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-1024x586.jpg 1024w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-300x172.jpg 300w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-768x440.jpg 768w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-1536x880.jpg 1536w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-65x37.jpg 65w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-225x129.jpg 225w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12-350x200.jpg 350w, https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-content\/uploads\/sites\/1469\/2021\/08\/Fig-10.12.jpg 1570w\" sizes=\"auto, (max-width: 665px) 100vw, 665px\" \/><\/p>\n<p>This means that there is a minimum family income that we reach before reaching the minimum grade. So the <strong>regression<\/strong> line becomes useless below an average grade of 77.96 (the Y intercept). Floor and ceiling effects are common problems for <strong>regression<\/strong>, and you should watch out for these problems when you use this technique. We can see that the <strong>regression<\/strong> line for this particular dataset is useful to make predictions for the grade of 80-100 average grade and the range of 0-190 income level.<\/p>\n<div id=\"h5p-81\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-81\" class=\"h5p-iframe\" data-content-id=\"81\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"Practice 10b.02. Correlation and regression\"><\/iframe><\/div>\n<\/div>\n<p>Now of course, predictions are not perfect. <strong>Regression<\/strong> allows for a prediction of one variable from another variable. As we can see in our scatterplots, not every real data point is exactly on the regression line. The actual data point might be different. Why is that? Because, unless it\u2019s a perfect <strong>correlation<\/strong>, some variability in the real data is not accounted for by the regression equation. We can estimate just how accurate our predictions are by looking at <strong>r squared<\/strong>. <strong>r<sup>2 <\/sup><\/strong>is the proportion of variance in one variable explained by its relationship with the other variable. The rest is the amount that is not accounted for.<\/p>\n<p>Just as we can include multiple factors in ANOVA, we can also include multiple predictive variables in a <strong>regression<\/strong>. We will not attempt that in this course, but if you take more advanced statistics course you will see that the more variables you include, each explaining a piece of the variability in the criterion variable, the more precise your <strong>regression<\/strong> model will become. Here, we are using just one predictive variable, and our <strong>r<sup>2<\/sup><\/strong> is likely to be well shy of 100% explained variance. So in that case, we can expect our <strong>regression<\/strong> to be only modestly accurate.<\/p>\n<h1>Chapter Summary<\/h1>\n<p>This chapter introduced you to the statistical techniques of <strong>correlation<\/strong> and <strong>regression<\/strong>. We saw how we can detect and describe the strength and direction of the relationship between two numeric variables, and to run a hypothesis test to find out if the <strong>correlation<\/strong> is significantly different from zero. Finally, we saw that <strong>regression<\/strong> can generate a linear model allow for the prediction of one variable from the other. A key reminder: <strong>correlation<\/strong> does <em>not<\/em> equal causation. These techniques suit research designs that do not meet the requirements of experimental design, and as such, our conclusions regarding the statistical findings must avoid cause-effect language.<\/p>\n<p>Key terms:<\/p>\n<table class=\"no-lines\" style=\"border-collapse: collapse;width: 100%\">\n<tbody>\n<tr>\n<td style=\"width: 33.3333%\"><strong>correlation<\/strong><\/td>\n<td style=\"width: 33.3333%\"><strong>regression<\/strong><\/td>\n<td style=\"width: 33.3333%\"><strong>r squared<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%\"><strong>covariance<\/strong><\/td>\n<td style=\"width: 33.3333%\"><strong>r<\/strong><\/td>\n<td style=\"width: 33.3333%\"><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"glossary\"><span class=\"screen-reader-text\" id=\"definition\">definition<\/span><template id=\"term_504_505\"><div class=\"glossary__definition\" role=\"dialog\" data-id=\"term_504_505\"><div tabindex=\"-1\"><p>statistical analysis of the direction and strength of the relationships between two numerical variables<\/p>\n<\/div><button><span aria-hidden=\"true\">&times;<\/span><span class=\"screen-reader-text\">Close definition<\/span><\/button><\/div><\/template><template id=\"term_504_507\"><div class=\"glossary__definition\" role=\"dialog\" data-id=\"term_504_507\"><div tabindex=\"-1\"><p>the variability that two numeric variables have in common<\/p>\n<\/div><button><span aria-hidden=\"true\">&times;<\/span><span class=\"screen-reader-text\">Close definition<\/span><\/button><\/div><\/template><template id=\"term_504_509\"><div class=\"glossary__definition\" role=\"dialog\" data-id=\"term_504_509\"><div tabindex=\"-1\"><p>a statistical model that allows for prediction based on a trend line that \u201cbest fits\u201d the data points that we have collected. Mathematically, a regression line is one that minimizes the squared deviations (i.e. error) of each point from the line.<\/p>\n<\/div><button><span aria-hidden=\"true\">&times;<\/span><span class=\"screen-reader-text\">Close definition<\/span><\/button><\/div><\/template><template id=\"term_504_523\"><div class=\"glossary__definition\" role=\"dialog\" data-id=\"term_504_523\"><div tabindex=\"-1\"><p>correlation coefficient that describes the strength and direction of the relationship between two numeric variables. Can be between -1 and 0 and between 0 and +1.<\/p>\n<\/div><button><span aria-hidden=\"true\">&times;<\/span><span class=\"screen-reader-text\">Close definition<\/span><\/button><\/div><\/template><template id=\"term_504_524\"><div class=\"glossary__definition\" role=\"dialog\" data-id=\"term_504_524\"><div tabindex=\"-1\"><p>proportion of variability in one variable that can be explained by the relationship with the other variable. Can be between 0 and 1.<\/p>\n<\/div><button><span aria-hidden=\"true\">&times;<\/span><span class=\"screen-reader-text\">Close definition<\/span><\/button><\/div><\/template><\/div>","protected":false},"author":1394,"menu_order":10,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[48],"contributor":[],"license":[],"class_list":["post-504","chapter","type-chapter","status-publish","hentry","chapter-type-numberless"],"part":3,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/pressbooks\/v2\/chapters\/504","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/wp\/v2\/users\/1394"}],"version-history":[{"count":25,"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/pressbooks\/v2\/chapters\/504\/revisions"}],"predecessor-version":[{"id":998,"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/pressbooks\/v2\/chapters\/504\/revisions\/998"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/pressbooks\/v2\/parts\/3"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/pressbooks\/v2\/chapters\/504\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/wp\/v2\/media?parent=504"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/pressbooks\/v2\/chapter-type?post=504"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/wp\/v2\/contributor?post=504"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/statspsych\/wp-json\/wp\/v2\/license?post=504"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}