An evaluation of synthetic data augmentation for mitigating covariate bias in health data

doi:10.1016/j.patter.2024.100946

. 2024 Feb 29;5(4):100946.

doi: 10.1016/j.patter.2024.100946. eCollection 2024 Apr 12.

An evaluation of synthetic data augmentation for mitigating covariate bias in health data

Lamin Juwara^{1

2}, Alaa El-Hussuna³, Khaled El Emam^{1

2

4}

Affiliations

¹ School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
² Research Institute, Children's Hospital of Eastern Ontario, Ottawa, ON, Canada.
³ Open Source Research Collaboration, Aalborg, Denmark.
⁴ Data Science, Replica Analytics Ltd., Ottawa, ON, Canada.

PMID: 38645766
PMCID: PMC11026977
DOI: 10.1016/j.patter.2024.100946

An evaluation of synthetic data augmentation for mitigating covariate bias in health data

Lamin Juwara et al. Patterns (N Y). 2024.

. 2024 Feb 29;5(4):100946.

doi: 10.1016/j.patter.2024.100946. eCollection 2024 Apr 12.

Authors

Lamin Juwara^{1

2}, Alaa El-Hussuna³, Khaled El Emam^{1

2

4}

Affiliations

¹ School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
² Research Institute, Children's Hospital of Eastern Ontario, Ottawa, ON, Canada.
³ Open Source Research Collaboration, Aalborg, Denmark.
⁴ Data Science, Replica Analytics Ltd., Ottawa, ON, Canada.

PMID: 38645766
PMCID: PMC11026977
DOI: 10.1016/j.patter.2024.100946

Abstract

Data bias is a major concern in biomedical research, especially when evaluating large-scale observational datasets. It leads to imprecise predictions and inconsistent estimates in standard regression models. We compare the performance of commonly used bias-mitigating approaches (resampling, algorithmic, and post hoc approaches) against a synthetic data-augmentation method that utilizes sequential boosted decision trees to synthesize under-represented groups. The approach is called synthetic minority augmentation (SMA). Through simulations and analysis of real health datasets on a logistic regression workload, the approaches are evaluated across various bias scenarios (types and severity levels). Performance was assessed based on area under the curve, calibration (Brier score), precision of parameter estimates, confidence interval overlap, and fairness. Overall, SMA produces the closest results to the ground truth in low to medium bias (50% or less missing proportion). In high bias (80% or more missing proportion), the advantage of SMA is not obvious, with no specific method consistently outperforming others.

Keywords: classification; covariate imbalance; data bias; fairness; generative model; synthetic data generation.

PubMed Disclaimer

Conflict of interest statement

This work was performed in collaboration with Replica Analytics Ltd. This company is a spin-off from the Children’s Hospital of Eastern Ontario Research Institute. K.E.E. is co-founder and has equity in this company.

Figures

**Figure 1**
Bias type Simulations results across 500 iterations for the full model showing the impact of bias (increasing missing proportion) on (A) odds ratio (OR) and the standard deviation of the biasing covariate across simulation runs, (B) the overall area under the curve (AUC), (C) minority-group AUC, and fairness metrics: (D) statistical parity difference (SPD), (E) equal opportunity difference (EOD), and (F) average odds difference (AOD). The dashed line represents the ground truth from the original data.

**Figure 2**
Bias mitigation approach Simulations results across 500 iterations for the full model showing the comparisons of the OR estimates and the standard deviation of the biasing covariate across simulation runs (A–C), overall model AUCs (D–F), and minority-group AUC of the biasing covariate (G–I) for synthetic minority augmentation (SMA), random oversampling (ROS, the best-performing alternative), and biased data. The dashed line represents the ground truth from unbiased data.

**Figure 3**
Bias mitigation approach Simulation results across 500 iterations for the full model showing the fairness estimates based on (A–C) statistical parity difference (SPD), (D–F) equal opportunity difference (EOD), and (G–I) average odds difference (AOD). The estimates are reported for biased data, ROS (the best-performing alternative), and SMA. The dashed line represents the optimal fairness value under no data bias.

**Figure 4**
Bias type Impact of bias (increasing missing proportion) on the CCHS data in evaluating the (A) odds ratio (OR) and standard error of the biasing covariate, (B) the overall AUC, (C) minority-group AUC, and three fairness metrics: (D) statistical parity difference (SPD), (E) equal opportunity difference (EOD), and (F) average odds difference (AOD). The bias is assessed under marginal bias (MB), conditional bias I (CBI), and conditional bias II (CBII). The dashed line represents the ground-truth results in the original dataset.

**Figure 5**
Bias mitigation approach Comparisons on the CCHS dataset of the OR estimates with and standard error of the biasing covariate (A–C), overall model AUCs (D–F), and minority-group AUC of the biasing covariate (G–I) for SMA, random oversampling (ROS, the best-performing alternative), and biased data. The dashed line represents the ground truth from unbiased data.

**Figure 6**
Bias mitigation approach Fairness metrics on the CCHS dataset: (A–C) the statistical parity difference (SPD), (D–F) equal opportunity difference (EOD), and (G–I) average odds difference (AOD) from the logistic regression model on the cardiovascular health data. The proportion of samples removed varied from 15% to 80%. Original cohort: SPD = 0.096, EOD = 0.092, and AOD = 0.098. The symbol ǂ indicates that the estimates are averaged from m = 100 synthetic copies. The dashed line represents the ground-truth fairness in the original dataset.

**Figure 7**
Summaries of the OR of the biasing covariate for the four real datasets The relative performance of each bias-mitigating approach compared to the biased data estimate is shown. The OR direction is considered improved if the difference between the model OR and the ground-truth estimate is less than the difference between the biased data OR and the ground truth. Summaries are over all four datasets, and proportions range from 15% to 80%.

**Figure 8**
Summaries of the model AUC for the four real datasets The relative performance of each bias-mitigating approach compared to biased data results is shown. The model AUC is considered improved if the difference between the model AUC and the ground-truth estimate is less than the difference between the biased data AUC and the ground truth. Summaries are over all four datasets, and proportions range from 15% to 80%.

**Figure 9**
Summaries of the minority-group AUC over the four real datasets The relative performance of each bias-mitigating approach compared to biased data results is shown. The minority AUC is considered improved if the difference between the model minority AUC and the ground-truth estimate is less than the difference between the biased data minority AUC and the ground truth. Summaries are over all bias proportions: 15%, 30%, 50%, 80%, and 95%.

**Figure 10**
Summaries of the fairness metrics SPD, EOD, and AOD over the four real datasets The relative performance of each bias-mitigating approach compared to biased data results is shown. Fairness is considered improved if the difference between the model fairness and the ground-truth estimate is less than the difference between the biased data fairness and the ground truth. Summaries are over all four datasets, and proportions range from 15% to 80%.

**Figure 11**
A schematic for synthetic data augmentation from biased data From (I) to (II), a synthetic version of the biased data is generated. In step (III), the synthetic minority group is sampled from (II) and augmented with the biased data in (I) to generate a rebalanced dataset.

**Figure 12**
Illustration of the synthesis process for a four-variable dataset

**Figure 13**
Schematic for training and evaluating bias-mitigating approaches

See this image and copyright information in PMC

Cited by

Generative AI mitigates representation bias and improves model fairness through synthetic health data.
Marchesi R, Micheletti N, I-Hsien Kuo N, Barbieri S, Jurman G, Osmani V. Marchesi R, et al. PLoS Comput Biol. 2025 May 19;21(5):e1013080. doi: 10.1371/journal.pcbi.1013080. eCollection 2025 May. PLoS Comput Biol. 2025. PMID: 40388536 Free PMC article.
Enhancement of Fairness in AI for Chest X-ray Classification.
Jackson NJ, Yan C, Malin BA. Jackson NJ, et al. AMIA Annu Symp Proc. 2025 May 22;2024:551-560. eCollection 2024. AMIA Annu Symp Proc. 2025. PMID: 40417510 Free PMC article.
Improving medical machine learning models with generative balancing for equity and excellence.
Theodorou B, Danek B, Tummala V, Kumar SP, Malin B, Sun J. Theodorou B, et al. NPJ Digit Med. 2025 Feb 14;8(1):100. doi: 10.1038/s41746-025-01438-z. NPJ Digit Med. 2025. PMID: 39953146 Free PMC article.
On the evaluation of synthetic longitudinal electronic health records.
Achterberg JL, Haas MR, Spruit MR. Achterberg JL, et al. BMC Med Res Methodol. 2024 Aug 14;24(1):181. doi: 10.1186/s12874-024-02304-4. BMC Med Res Methodol. 2024. PMID: 39143466 Free PMC article.
Clinical Algorithms and the Legacy of Race-Based Correction: Historical Errors, Contemporary Revisions and Equity-Oriented Methodologies for Epidemiologists.
Horsfall LJ, Bondaronek P, Ive J, Poduval S. Horsfall LJ, et al. Clin Epidemiol. 2025 Jul 12;17:647-662. doi: 10.2147/CLEP.S527000. eCollection 2025. Clin Epidemiol. 2025. PMID: 40673267 Free PMC article. Review.

See all "Cited by" articles

References

1. Yadav P., Steinbach M., Kumar V., Simon G. Mining Electronic Health Records (EHRs): A Survey. ACM Comput. Surv. 2018;50:1–40. doi: 10.1145/3127881. - DOI
1. Detsky A.S. Sources of bias for authors of clinical practice guidelines. CMAJ (Can. Med. Assoc. J.) 2006;175:1033–1035. doi: 10.1503/cmaj.061181. - DOI - PMC - PubMed
1. Glauner P., Valtchev P., State R. Impact of biases in big data. arXiv. 2018 doi: 10.48550/arXiv.1803.00897. Preprint at. - DOI
1. Cirillo D., Catuara-Solarz S., Morey C., Guney E., Subirats L., Mellino S., Gigante A., Valencia A., Rementeria M.J., Chadha A.S., Mavridis N. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. Npj Digit. Med. 2020;3:81. doi: 10.1038/s41746-020-0288-5. - DOI - PMC - PubMed
1. Pandis N. Bias in observational studies. Am. J. Orthod. Dentofacial Orthop. 2014;145:542–543. doi: 10.1016/j.ajodo.2014.01.008. - DOI - PubMed

LinkOut - more resources

Full Text Sources

[1] Yadav P., Steinbach M., Kumar V., Simon G. Mining Electronic Health Records (EHRs): A Survey. ACM Comput. Surv. 2018;50:1–40. doi: 10.1145/3127881. - DOI

[2] Yadav P., Steinbach M., Kumar V., Simon G. Mining Electronic Health Records (EHRs): A Survey. ACM Comput. Surv. 2018;50:1–40. doi: 10.1145/3127881. - DOI

[3] Detsky A.S. Sources of bias for authors of clinical practice guidelines. CMAJ (Can. Med. Assoc. J.) 2006;175:1033–1035. doi: 10.1503/cmaj.061181. - DOI - PMC - PubMed

[4] Detsky A.S. Sources of bias for authors of clinical practice guidelines. CMAJ (Can. Med. Assoc. J.) 2006;175:1033–1035. doi: 10.1503/cmaj.061181. - DOI - PMC - PubMed

[5] Glauner P., Valtchev P., State R. Impact of biases in big data. arXiv. 2018 doi: 10.48550/arXiv.1803.00897. Preprint at. - DOI

[6] Glauner P., Valtchev P., State R. Impact of biases in big data. arXiv. 2018 doi: 10.48550/arXiv.1803.00897. Preprint at. - DOI

[7] Cirillo D., Catuara-Solarz S., Morey C., Guney E., Subirats L., Mellino S., Gigante A., Valencia A., Rementeria M.J., Chadha A.S., Mavridis N. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. Npj Digit. Med. 2020;3:81. doi: 10.1038/s41746-020-0288-5. - DOI - PMC - PubMed

[8] Cirillo D., Catuara-Solarz S., Morey C., Guney E., Subirats L., Mellino S., Gigante A., Valencia A., Rementeria M.J., Chadha A.S., Mavridis N. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. Npj Digit. Med. 2020;3:81. doi: 10.1038/s41746-020-0288-5. - DOI - PMC - PubMed

[9] Pandis N. Bias in observational studies. Am. J. Orthod. Dentofacial Orthop. 2014;145:542–543. doi: 10.1016/j.ajodo.2014.01.008. - DOI - PubMed

[10] Pandis N. Bias in observational studies. Am. J. Orthod. Dentofacial Orthop. 2014;145:542–543. doi: 10.1016/j.ajodo.2014.01.008. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An evaluation of synthetic data augmentation for mitigating covariate bias in health data

Affiliations

An evaluation of synthetic data augmentation for mitigating covariate bias in health data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources