. 2024 Aug 28;24(1):188.

doi: 10.1186/s12874-024-02310-6.

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review

Marziyeh Afkanpour¹, Elham Hosseinzadeh¹, Hamed Tabesh²

Affiliations

¹ Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
² Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran. tabesh79@gmail.com.

PMID: 39198744
PMCID: PMC11351057
DOI: 10.1186/s12874-024-02310-6

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review

Marziyeh Afkanpour et al. BMC Med Res Methodol. 2024.

. 2024 Aug 28;24(1):188.

doi: 10.1186/s12874-024-02310-6.

Authors

Marziyeh Afkanpour¹, Elham Hosseinzadeh¹, Hamed Tabesh²

Affiliations

¹ Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
² Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran. tabesh79@gmail.com.

PMID: 39198744
PMCID: PMC11351057
DOI: 10.1186/s12874-024-02310-6

Abstract

Background and objectives: Comprehending the research dataset is crucial for obtaining reliable and valid outcomes. Health analysts must have a deep comprehension of the data being analyzed. This comprehension allows them to suggest practical solutions for handling missing data, in a clinical data source. Accurate handling of missing values is critical for producing precise estimates and making informed decisions, especially in crucial areas like clinical research. With data's increasing diversity and complexity, numerous scholars have developed a range of imputation techniques. To address this, we conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ratio of missingness, to identify the most appropriate imputation methods in the healthcare field.

Materials and methods: We searched four information databases namely PubMed, Web of Science, Scopus, and IEEE Xplore, for articles published up to September 20, 2023, that discussed imputation methods for addressing missing values in a clinically structured dataset. Our investigation of selected articles focused on four key aspects: the mechanism, pattern, ratio of missingness, and various imputation strategies. By synthesizing insights from these perspectives, we constructed an evidence map to recommend suitable imputation methods for handling missing values in a tabular dataset.

Results: Out of 2955 articles, 58 were included in the analysis. The findings from the development of the evidence map, based on the structure of the missing values and the types of imputation methods used in the extracted items from these studies, revealed that 45% of the studies employed conventional statistical methods, 31% utilized machine learning and deep learning methods, and 24% applied hybrid imputation techniques for handling missing values.

Conclusion: Considering the structure and characteristics of missing values in a clinical dataset is essential for choosing the most appropriate data imputation technique, especially within conventional statistical methods. Accurately estimating missing values to reflect reality enhances the likelihood of obtaining high-quality and reusable data, contributing significantly to precise medical decision-making processes. Performing this review study creates a guideline for choosing the most appropriate imputation methods in data preprocessing stages to perform analytical processes on structured clinical datasets.

Keywords: Clinical dataset; Imputation methods; Mechanism of missingness; Missing ratio; Missing values; Pattern of missingness; Simulation study.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
The flow diagram of Preferred Reporting Items for Systematic Review (PRISMA)

**Fig. 2**
Trend of primary studies over the last 8 years

**Fig. 3**
Part (a) shows the percentage of primary studies based on their data sources. Part (b) displays the distribution of clinical contexts, emphasizing the importance of managing missing data in medical decision-making processes. Part (c) illustrates the software used for implementing data imputation techniques. Part (d) presents the types of missing variables extracted from each study. In part (e) distribution of missing value patterns is shown. Part (f) depicts the frequency of missing value mechanisms mentioned in the primary studies

**Fig. 4**
Main categorization of imputation methods

**Fig. 5**
Evidence map: The evidence map illustrates the main categorized imputation methods, which are classified into four groups, and various types of structures for missing values assumed in seven cases ((mechanism, pattern, ratio of missing values), (mechanism, pattern), (mechanism, ratio of missing values), (pattern, ratio of missing values), (mechanism), (pattern), (ratio of missing values))

See this image and copyright information in PMC

References

1. Little RJ, Rubin DB. Statistical Analysis with Missing Data, vol. 793. Hoboken, NJ, USA: Wiley; 2019.
1. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92. 10.1093/biomet/63.3.581 - DOI
1. Galimard JE, Chevret S, Protopopescu C, Resche-Rigon M. A multiple imputation approach for MNAR mechanisms compatible with Heckman’s model. Stat Med. 2016;35(17):2907–20. 10.1002/sim.6902 - DOI - PubMed
1. Miettinen OS. Theoretical epidemiology: principles of occurrence research in medicine. In Theoretical epidemiology: principles of occurrence research in medicine 1985 (pp. xxii-359).
1. Humphries M. Missing Data & How to Deal: an overview of missing data. Popul Res Cent. 2013; 45.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review

Affiliations

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources