Accounting for missing data in statistical analyses: multiple imputation is not always the answer

Rachael A Hughes^{1

2}, Jon Heron^{1

2

3}, Jonathan A C Sterne^{1

3}, Kate Tilling^{1

2

3}

Affiliations

¹ Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK.
² MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK.
³ NIHR Bristol Biomedical Research Centre, University of Bristol, Bristol, UK.

PMID: 30879056
PMCID: PMC6693809
DOI: 10.1093/ije/dyz032

Accounting for missing data in statistical analyses: multiple imputation is not always the answer

Rachael A Hughes et al. Int J Epidemiol. 2019.

. 2019 Aug 1;48(4):1294-1304.

doi: 10.1093/ije/dyz032.

Authors

Rachael A Hughes^{1

2}, Jon Heron^{1

2

3}, Jonathan A C Sterne^{1

3}, Kate Tilling^{1

2

3}

Affiliations

¹ Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK.
² MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK.
³ NIHR Bristol Biomedical Research Centre, University of Bristol, Bristol, UK.

PMID: 30879056
PMCID: PMC6693809
DOI: 10.1093/ije/dyz032

Abstract

Background: Missing data are unavoidable in epidemiological research, potentially leading to bias and loss of precision. Multiple imputation (MI) is widely advocated as an improvement over complete case analysis (CCA). However, contrary to widespread belief, CCA is preferable to MI in some situations.

Methods: We provide guidance on choice of analysis when data are incomplete. Using causal diagrams to depict missingness mechanisms, we describe when CCA will not be biased by missing data and compare MI and CCA, with respect to bias and efficiency, in a range of missing data situations. We illustrate selection of an appropriate method in practice.

Results: For most regression models, CCA gives unbiased results when the chance of being a complete case does not depend on the outcome after taking the covariates into consideration, which includes situations where data are missing not at random. Consequently, there are situations in which CCA analyses are unbiased while MI analyses, assuming missing at random (MAR), are biased. By contrast MI, unlike CCA, is valid for all MAR situations and has the potential to use information contained in the incomplete cases and auxiliary variables to reduce bias and/or improve precision. For this reason, MI was preferred over CCA in our real data example.

Conclusions: Choice of method for dealing with missing data is crucial for validity of conclusions, and should be based on careful consideration of the reasons for the missing data, missing data patterns and the availability of auxiliary information.

Keywords: Complete case analysis; inverse probability weighting; missing data; missing data mechanisms; missing data patterns; multiple imputation.

PubMed Disclaimer

Figures

**Figure 1.**
Diagrams showing causal relationships between the completely observed outcomes of the linear and logistic regression (depression symptom score and self-harm respectively), completely observed covariates maternal substance use and sex, incompletely observed exposure cannabis use, and MissCU, a binary variable that indicates whether cannabis use is observed or missing. Note, for clarity we have not included all arrows between the covariates.

**Figure 2.**
Diagram showing the causal relationship between the outcome [adult body mass index (BMI), exposure (weight at age 5), confounders (birth weight, sex, gestational age, maternal weight, paternal weight and parental socioeconomic status (SES)], and complete case, a binary variable that indicates whether a participant is a complete case (observed values for the outcome, exposure and all confounders) or an incomplete case (missing values for at least one of these variables). Note, we have not included all arrows between the covariates.

See this image and copyright information in PMC

References

1. Little RJA, Rubin DB.. Statistical Analysis with Missing Data. 2nd edn. Hoboken, NJ: Wiley, 2002.
1. Schafer JL, Graham JW.. Missing data: our view of the state of the art. Psychol Methods 2002;7:147–77. - PubMed
1. Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G.. Handbook of Missing Data Methodology. London: Chapman and Hall/CRC, 2014.
1. Carpenter JR, Goldstein H, Kenward MG.. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw 2011;45:1–14.
1. Honaker J, King G, Blackwell M.. Amelia II: a program for missing data. J Stat Softw 2011;45:1–47.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accounting for missing data in statistical analyses: multiple imputation is not always the answer

Affiliations

Accounting for missing data in statistical analyses: multiple imputation is not always the answer

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources