8 Hypotheses Testing
8.1 Causality
From the start, I need to make one thing clear: regardless if observed only in sample data or generalizable to populations, so far we have only discussed statistical associations.
Well, what kind of other associations could we discuss, I can imagine you grumbling, it’s a statistics textbook — of course the associations will be statistical!
You are correct, of course, but (you knew there will be a “but”) — “statistical” here has a very narrow meaning, something most people unfamiliar with statistics seem unaware of and thus interpreting to mean a lot more than it actually does.
You see, statistical association refers only to whether there is a pattern in the data or not; whether certain attributes of one variable tend to go with specific attributes of another variable. In no way does this imply that one variable is what it is because of another, or that a change in one causes another variable to change, or that a variable is dependent on another.
If we can state any of these, we make a much stronger claim — one of causality — and the associations are then called causal[1]. When we have a causal association, we call one variable independent and the other dependent[2].
See if you can differentiate statistical and causal associations. Smoking is associated with lung cancer: people who smoke (or smoke more) have lung cancer at higher rates than those who don’t. Smoking causes lung cancer: smokers are more likely to get lung cancer because of the fact that they smoke. Class attendance and test scores are associated: students who attend more classes have higher test scores. Test scores are dependent on class attendance: coming to class more often is partly responsible for higher test scores. Parental education and offspring education are positively correlated: higher levels of parental schooling are associated with higher levels of schooling for the offspring. Individuals with higher levels of schooling have more education because their parents were better educated themselves.
The first sentence in any of the examples in the previous paragraph was a statement of statistical association, the second statement was one of causality. If they generally sound the same to you, you should start paying more explicit attention to phrasing, specifically how the claims of association are put into words. As the one of most often-quoted sayings in statistics goes, correlation is not causation. Apart from urging caution in interpreting results, it also brings attention to how careful researchers must be when reporting results and conclusions in order to not overstate their claims.
What is the main difference between statistical association[3] and causation? Briefly, the method of establishing either; what is necessary for us to be able to claim one or the other.
Establishing a statistical association between two variables is relatively straightforward and easy: there are tests for that (as we shall shortly see)[4]. Establishing a causal association between two variables (especially in the social sciences), on the other hand, is notoriously hard.
Criteria for establishing causality. There are three basic requirements for establishing causal associations, and an additional, overarching one related to the logic of research as a whole.
- Does the variable we claim is the cause come before the variable we claim as an effect in time?
This requirement is also known as temporal precedence — that is, whether the potential cause happens before the potential outcome. It is squarely based on logic: after all, an outcome cannot logically precede its cause. You can’t take a test on the first day of class, and claim that your test score was due to your attending class or being absent later in the semester: that’s not how time works. Similarly, you cannot claim that the bachelor’s degree you will get in the near future is somehow responsible for your parents college degrees from twenty or so years ago.
While in these examples the temporal precedence is crystal clear, keep in mind that this is not always the case. There are plenty of situations in social research when it’s difficult to adjudicate which one of a pair of variables came first, as well as cases of mutual causality and reverse causality. Without getting into too much detail take, for example, the popular finding [citation: Waite] that married people tend to be happier, on average. One can easily conclude that marriage promotes happiness. But what if happier people tend to have more successful relationships leading to marriage and a related propensity to stay married? Which one, marriage or happiness, is the cause of the association and which one the outcome? Further analysis and investigation of the variables’ association is necessary in such a case (and even that might not lead to definite conclusion).
2. Are the two variables statistically associated?
This provides further evidence that statistical association is different from causation by listing the presence of a statistical association as a necessary requirement for establishing causality, among others. In short, the presence of a statistical association between two variables is a necessary but not sufficient condition for claiming causality.
Why it’s necessary should be obvious: we cannot claim that we have a variable we think is a cause to a potential outcome variable, if we have no evidence whatsoever that they are statistically associated in the first place. Otherwise, if there is no observable pattern between the values/categories of the two variables, how can we claim that changes in one variable cause changes in the other? Again, logically, the cause and the effect must be related in some way for which association we have enough evidence with a specific desired level of certainty. (The remaining chapters are devoted to finding just that type of evidence.)
3. Are there no alternative explanations of the variables’s statistical association?
This condition is the most complicated one of the three, as it requires the examination of other variables and not just the two of initial interest. Again briefly, there are concerns about causality due to the social world being vastly complex and to the social science variables’ complicated interplay in real life. Basically, in the social world there rarely is a single cause of anything.
For example, is the statistical association in question observed because the potential cause variable indeed affects the potential outcome variable — or because both variables are in fact effects of a third variable (sometimes without any association between the original two variables)? Can we differentiate between a genuine relationship and a so-called spurious (i.e., fake, bogus) one, like the one described? As well, perhaps we only observe a statistical association between two variables and claim one as the cause because we haven’t considered different potential causes. How can we be certain that it is (solely) the “cause” we have identified, or that if we considered alternative causes, the original so-called “cause” will remain as one?
Regarding the latter, consider again the association between class attendance and test scores. Would you believe me if I told you that your statistics test scores depended only on your class attendance? What about hours of studying, potential after-class tutoring, doing exercises, pre-existing math knowledge, searching for/reading additional sources online or in the library, asking relevant questions in class and/or office hours, etc., etc.?
There are numerous reasons why anyone would score higher or lower on a test, and I just listed a few of the study-related ones. We don’t need to limit ourselves to these though. How about general health on the date of the exam (maybe you have come to the test sick)? Or romantic relationship or family problems one might be going through? A sick relative at home? Episodes of anxiety and/or depression? Being overworked, working a night shift before the test, and/or not getting enough sleep for another reason?
You certainly can add even more reasons for why a particular test score ends up what it is, and that class attendance is merely one such potential cause. (Are we even certain that, if we somehow accounted for all the other potential causes, we would still observe an association between attendance and scores?)
As to spurious associations, consider that it’s possible for two variables to seem associated (i.e., there is a pattern between their values/categories; changes in one are accompanied by changes in the other) only because a third variable is causing the changes in both. Then if, instead of focusing on the two genuine associations, we ignore the third variable and focus on its two outcomes which just happen to change at the same time, we would make a wrong conclusion in attributing causality to an association that essentially doesn’t exist.
Take for example life expectancy and internet: Since 1990s, as internet was becoming more and more widespread in Canada, the Canadian life expectancy at birth was also increasing. We can therefore conclude that internet prolongs life. But there is a reason why you’ve never before heard about this particular beneficial effect of internet on one’s health and life — it’s extremely doubtful it exists. After all, wouldn’t it make more sense to attribute both to general technological progress (not only in communications, IT, and infrastructure but also in healthcare and medicine)?
Finally, this is where the additional, overarching general condition for causality comes into play. Assuming the three conditions listed above are met, claiming causality essentially implies providing a logical explanation of the observed association. In and of itself, causality is about having a theory — an idea, if you will, why there is such an association. Without such an idea, we are left simply with two variables which may be or may not be statistically — but definitely not causally — associated, and the statistical association doesn’t mean much, on its own [5]. And given that the potential statistical association you may think exists might not even be there once other alternative causes are considered, you should realize by now that making a causal claim is indeed not a walk in the park.
What is to be done then? Obviously, such a brief presentation on the topic leaves a lot to be desired and is not going to be enough to fully prepare you for the task of comprehensively establishing causality in real-life research. What you should be able to do even now, however, is appreciate causality’s complexity, keep in mind the necessary conditions for claiming causality (and apply these when reading about research findings and questioning conclusions), and always, always keep an eye on alternative explanations in particular (by asking yourself “what else could be causing this?”). These should provide enough basis for you not to take statements about statistical association between variables as more than they are, and to not confuse them with claims about causality.
As well, I hope you would be careful in phrasing your own conclusions when communicating statistical research to others by not overstating the findings of any analyses you might end up doing, especially if they involve only two variables, as per our discussion. By now it should be clear that real-life research considers many variables at the same time. Such multivariate analysis lies beyond the scope of this book so you should take any bivariate associations we discuss to be of solely indicative (or exploratory) nature — something that additional, multivariate analysis may establish at a later point, but definitely not a finished product. After all, you didn’t expect that you can establish causality by considering only two variables, did you?
With this in mind, we proceed with the question of how to establish statistical associations — and not just the ones observable in sample data, but the associations in which we are truly interested, i.e., those generalizable to populations. You may not be able to make claims about causality at this point but you can certainly learn how to test for evidence of statistical associations between two variables. To that purpose, the next section introduces the logic of using hypotheses in research and how hypotheses get tested.
- Please make sure you don't confuse causal ['KO-zal] and causality [ko-'ZA-liti] with casual ['KEH-jwal] and casuality (which doesn't exist). ↵
- You can think of the independent variable (i.e., the cause) as free to vary on its own; with or without the dependent variable, the independent is what it is. On the other hand, the dependent variable (the effect) varies because of the independent one, that's why it's called dependent. (Note that it's dependent variable and not dependant. The latter applies to people who are economically supported by others, like children are dependants of their parents.) ↵
- While many times the words association and correlation are used interchangeably, I prefer to use correlation only in relation to continuous variables in the context of the correlation coefficient. Referring to any statistical association as correlation, however, is technically not wrong; the usage is simply a matter of preference. ↵
- Of course, it's not as easy as I present it further in this text. As an introduction to the topic, however, it will suffice. My point is that relative to establishing causality, it is easier. ↵
- You most certainly need to check these associations out: http://www.tylervigen.com/spurious-correlations. (At this point you need any distraction you can get, and this time you can even say it's for a good, pedagogically meaningful cause. Or so I can tell myself.) Among them, you'll learn that the number of doctorates in Sociology awarded in the USA is very strongly correlated over time with worldwide non-commercial space launches, not to mention that the number of drownings by people falling into a pool correlates moderately strongly with the number of movies in which Nicholas Cage appeared for the ten years between 1999 and 2009 (CITATION Spurious Media LLC/Tyler Vigen http://www.tylervigen.com/spurious-correlations ↵