Missing data: deletion, imputation and reporting
An advanced guide to understanding missing data mechanisms, complete-case analysis, imputation, bias, sensitivity and transparent reporting.
Structure
Problem, intuition, method, working, limitations and discussion.
Best for
Students preparing for coursework, analysis, interpretation or revision.
Use with
Learning Hub lessons, tutoring sessions or dissertation planning.
Resource guide
Problem
Missing data are common in real research datasets, but students often handle them casually. They may ignore missingness, delete incomplete rows automatically or replace missing values with simple averages without considering why the data are missing. These choices can change the sample, reduce power, introduce bias and affect conclusions. Missing data handling is therefore not just a technical cleaning step; it is part of the validity of the analysis.
- Missing values are sometimes hidden as codes such as 99, 999 or -1.
- Students often delete incomplete rows without reporting how many were removed.
- Mean imputation is used without understanding its consequences.
- Different analyses may use different sample sizes without explanation.
- Missingness may be related to the outcome or exposure.
- Complete-case analysis can reduce power and bias estimates.
- The missing data method is often absent from dissertation reports.
Resource guide
Intuition
The key question is not only how much data are missing, but why they are missing. If data are missing completely at random, deletion may be less problematic. If missingness is related to observed variables, methods such as multiple imputation may be more appropriate. If missingness depends on unobserved values, no simple method fully solves the problem and sensitivity analysis becomes important.
- Missing completely at random means missingness is unrelated to observed or unobserved data.
- Missing at random means missingness can be explained by observed variables.
- Missing not at random means missingness depends on unobserved values or the missing value itself.
- Small amounts of missing data can still matter if missingness is systematic.
- Large amounts of missing data can reduce precision and credibility.
- The missingness pattern should be described before choosing a method.
Resource guide
Method
A sensible missing data workflow begins by identifying missing values correctly. Then the analyst should summarise the amount and pattern of missingness, compare complete and incomplete cases where useful, decide on a handling method and report the decision clearly. The method should be justified based on the research question, amount of missingness, likely missingness mechanism and complexity of the analysis.
- Step 1: Identify all missing value codes in the raw dataset.
- Step 2: Convert hidden missing codes into proper missing values.
- Step 3: Count missing values for each variable.
- Step 4: Examine missingness patterns across key variables.
- Step 5: Check whether missingness is related to observed characteristics.
- Step 6: Decide whether complete-case analysis is defensible.
- Step 7: Consider multiple imputation when missingness is substantial and plausibly related to observed variables.
- Step 8: Avoid simple mean imputation for formal inference.
- Step 9: Compare results across methods where possible.
- Step 10: Report missing data handling transparently.
Resource guide
Working
Suppose a health study examines whether BMI is associated with blood pressure, but BMI is missing for 18% of participants. If those with missing BMI are similar to those with observed BMI, complete-case analysis may be reasonable, although power is reduced. If BMI is more likely to be missing among older participants or those with severe disease, deletion may bias results. Multiple imputation may be considered if the missingness can be explained using observed variables.
- First, report how many participants have missing BMI.
- Second, compare key characteristics between complete and incomplete cases.
- Third, decide whether deletion changes the analysis sample substantially.
- Fourth, consider whether missingness is plausibly related to observed variables.
- Fifth, use complete-case analysis only if the assumptions are acceptable.
- Sixth, consider multiple imputation for more defensible analysis when appropriate.
- Seventh, report whether conclusions changed under different approaches.
Resource guide
Limitations
No missing data method is magic. Complete-case analysis can be biased. Simple imputation can underestimate variability. Multiple imputation relies on assumptions and must be implemented carefully. If data are missing not at random, standard methods may still be biased. The most important principle is transparency: readers should know what was missing, how it was handled and how it may affect conclusions.
- Complete-case analysis can waste data and introduce bias.
- Mean imputation distorts variability and relationships between variables.
- Multiple imputation requires careful model specification.
- Imputation cannot recover information that was never collected.
- Missing not at random mechanisms are difficult to handle.
- High missingness can weaken credibility even after imputation.
- Sensitivity analysis may be needed when assumptions are uncertain.
Resource guide
Discussion
A strong dissertation or research report should include a short missing data statement. This should describe the extent of missingness, the method used, the rationale for that method and the possible impact on interpretation. Missing data should not be hidden because unexplained changes in sample size can undermine confidence in the analysis.
- Report the number and percentage missing for important variables.
- Explain whether the analysis used complete cases or imputation.
- Justify the chosen method briefly.
- Mention whether missingness may have biased the findings.
- Avoid pretending missing data do not matter.
- Use tables or flow diagrams where sample exclusions are important.
- Discuss missing data as a limitation when relevant.
Practical checklist
Before you apply this topic
- Have you identified all missing value codes?
- Have you counted missing values for each key variable?
- Have you checked missingness patterns?
- Have you compared complete and incomplete cases where useful?
- Have you considered why data are missing?
- Have you avoided automatic deletion without explanation?
- Have you avoided simple mean imputation for formal inference?
- Have you considered whether multiple imputation is appropriate?
- Have you reported the final analysis sample size?
- Have you explained how missing data were handled?
- Have you discussed possible bias from missing data?
- Have you considered sensitivity analysis if needed?
Common mistakes
What to avoid
- Ignoring missing data completely.
- Leaving hidden missing codes as real numerical values.
- Deleting all incomplete rows without reporting it.
- Using mean imputation without understanding its effect.
- Reporting different sample sizes without explanation.
- Assuming missing data are harmless because the percentage is small.
- Imputing outcome values carelessly.
- Using imputation without including important related variables.
- Failing to compare complete and incomplete cases.
- Not discussing missing data as a limitation.
How this connects to learning
Use the guide as a bridge between theory and application.
A resource guide should not replace a full course or live teaching session. Instead, it helps you organise your thinking. Use it to identify what you understand, what feels unclear, and what questions you should ask before applying a method to real data.
Before a lesson
Read the intuition and problem sections to prepare.
During analysis
Use the method and checklist to guide decisions.
When writing
Use limitations and discussion to improve interpretation.
Related guides
