Chapter 2 What Data Looks Like and Summarizing Data

2.3.2 Missing Data: Adding Valid Percentages

If you’ve paid attention so far, you must have noticed that three of our 21 respondents provided a “Didn’t answer” response when asked about their educational attainment. Sometimes respondents may refuse to answer a question, or the question may not have been applicable to them and wasn’t asked, or a response might not get recorded due to an error, etc. In short, sometimes we have a case of what is known as missing data.

 

What do we know about the educational attainment of the three individuals who, for whatever reason, didn’t answer this question? Nothing.

 

Can we in some way infer their educational attainment? Not with the data provided in the example.

 

So then what do we do? How do we analyze our educational attainment variable?

 

The most frequent — and strongly recommended (especially for people just starting on their journey to research) — course of action is to simply drop the missing cases[1]. Missing cases have no part in any analysis and using them as they are would inevitably compromise conclusions — after all, we have no information on what we want to know about them, and we cannot make that information up.

 

Generally, how statistical software deal with missing data by default settings may vary. SPSS’s default is to skip missing cases so that analysis is always based on valid cases only.

 

As well, SPSS provides a separate column in Data View indicating which values in the data stand for a missing data point. As discussed in Section 2.1 (https://pressbooks.bccampus.ca/simplestats/chapter/2-1-data/), you can find the coding of the values in the Values column in Data View.  Clicking the specific cell in that column opens up a window with the values’ code. There you may find several types of missing data, typically values such as “Valid skip”/”Not applicable” (the respondent had not been asked the question on which the variable is based due to a previous answer)[2], “Don’t know” (the respondent did not know the answer to the question), “Refusal” (the respondent refused to answer the question), “Not stated” (when the question should have been answered/ an answer should have been recorded but, for whatever reason, it hasn’t been), etc.

 

Apart from “Not applicable”, the codes listed here are standard Statistics Canada codes used in all their datasets and can be found in any Statistics Canada dataset documentation[3].

 

So given that we had three cases of missing data within our group of 21 respondents, are the percentages reported in the previous sub-section’s Table 2.2 in Example 2.2 (C) valid to use?

 

Watch Out!! #4… for Findings Based on Missing Data

 

This will be a short warning but it deserves it’s own scary-red Watch Out!! reiteration: do not trust analysis and findings that include missing cases as they would be distorted and unreliable. Missing data is exactly that – missing. It simply does not exist. As a beginner researcher, always make sure you have dropped (i.e., excluded) any missing cases before analyzing your data and reporting any results.

 

Considering that Table 2.2 did include missing data in the calculation of percentages, let us correct that by modifying it and including another column, valid percentages.

 

Example 2.2 (D) Hypothetical Data on Educational Attainment, Organized and with Relative Frequencies and Valid Percentages Added

Table 2.3 Educational Attainment by Frequency, Percent and Valid Percent

  Degree

  Frequency

Percent

Valid Percent

Valid    No degree 1 4.7 5.6
   Secondary/High School 6 28.6 33.3
   Associate’s 3 14.3 16.7
   Bachelor’s 5 23.8 27.8
   Master’s 2 9.5 11.1
   PhD 1 4.7 5.6
   Total Valid 18 85.6 100.0
Missing    Didn’t answer 3 14.3
   Total Missing 3 14.3
    TOTAL 21 100.0  

 

As you see in the modified Table 2.3 above, I have separated the missing cases from the valid cases (the cases for which we have educational attainment data). Since we have only 18 valid cases, we should use only those 18 cases for any calculations and analysis — and not the total of 21 cases (which includes the missing). Thus, instead of having just

 

    \[\frac{f}{N}(100)=\frac{1}{21}(100)=0.047(100)=4.7\%\]

 

along with the rest of the categories’ percentages calculated in this way, we should calculate the categories’ valid percentages, discarding he three missing cases, like this:

 

    \[\frac{f}{N}(100)=\frac{1}{18}(100)=0.056(100)=5.6\%\]

 

(As usual, I only show you the calculation for the first category as the rest follow in the same way.)

 

Despite the fact that we do have the percentages based on missing data in the table, note that these – the valid percentages — are the only percentages you should use in your analysis and report in your findings.

 

Alright, you might say now, we added percentages and valid percentages to the simple frequencies, this surely means we have a complete frequency table by now.

 

Sorry, no, not yet. One thing remains.


  1. Depending on the particular data and particular situation, and assuming strong justification, researchers experienced in data analysis may have different options, such as estimation, imputation of means, etc. These, however, are beyond the scope of this text. The safest action for students/beginners to take remains dropping any missing cases from the analysis. See https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/ for a discussion.
  2. For example, if a respondent has indicated previously that they didn't smoke, a subsequent question about how often they smoked would make no sense; the respondent then would be "validly skipped" from answering this subsequent question.
  3. Currently, Statistics Canada uses 6, 96, 996, etc. for "Valid skip"; 7, 97, 997, etc. for "Don't know"; 8, 98, 998, etc. for "Refused"; and 9, 99, 999, etc. for "Not stated".

License

Simple Stats Tools Copyright © by Mariana Gatzeva. All Rights Reserved.

Share This Book