. 2025 May 19;21(5):e1013080.

doi: 10.1371/journal.pcbi.1013080. eCollection 2025 May.

Generative AI mitigates representation bias and improves model fairness through synthetic health data

Raffaele Marchesi^{1

2}, Nicolo Micheletti^{1

3}, Nicholas I-Hsien Kuo⁴, Sebastiano Barbieri^{4

5}, Giuseppe Jurman^{1

6}, Venet Osmani⁷

Affiliations

¹ Data Science for Health (DSH), Fondazione Bruno Kessler, Trento, Italy.
² Department of Mathematics, University of Pavia, Pavia, Italy.
³ Department of Computer Science, University of Manchester, Manchester, United Kingdom.
⁴ Centre for Big Data Research in Health, University of New South Wales, Sydney, New South Wales, Australia.
⁵ Queensland Digital Health Centre, University of Queensland, Brisbane, Queensland, Australia.
⁶ Department of Biomedical Sciences, Humanitas University, Milan, Italy.
⁷ Digital Environment Research Institute (DERI), Queen Mary University of London, London, United Kingdom.

PMID: 40388536
PMCID: PMC12112403
DOI: 10.1371/journal.pcbi.1013080

Generative AI mitigates representation bias and improves model fairness through synthetic health data

Raffaele Marchesi et al. PLoS Comput Biol. 2025.

. 2025 May 19;21(5):e1013080.

doi: 10.1371/journal.pcbi.1013080. eCollection 2025 May.

Authors

Raffaele Marchesi^{1

2}, Nicolo Micheletti^{1

3}, Nicholas I-Hsien Kuo⁴, Sebastiano Barbieri^{4

5}, Giuseppe Jurman^{1

6}, Venet Osmani⁷

Affiliations

¹ Data Science for Health (DSH), Fondazione Bruno Kessler, Trento, Italy.
² Department of Mathematics, University of Pavia, Pavia, Italy.
³ Department of Computer Science, University of Manchester, Manchester, United Kingdom.
⁴ Centre for Big Data Research in Health, University of New South Wales, Sydney, New South Wales, Australia.
⁵ Queensland Digital Health Centre, University of Queensland, Brisbane, Queensland, Australia.
⁶ Department of Biomedical Sciences, Humanitas University, Milan, Italy.
⁷ Digital Environment Research Institute (DERI), Queen Mary University of London, London, United Kingdom.

PMID: 40388536
PMCID: PMC12112403
DOI: 10.1371/journal.pcbi.1013080

Abstract

Representation bias in health data can lead to unfair decisions and compromise the generalisability of research findings. As a consequence, underrepresented subpopulations, such as those from specific ethnic backgrounds or genders, do not benefit equally from clinical discoveries. Several approaches have been developed to mitigate representation bias, ranging from simple resampling methods, such as SMOTE, to recent approaches based on generative adversarial networks (GAN). However, generating high-dimensional time-series synthetic health data remains a significant challenge. In response, we devised a novel architecture (CA-GAN) that synthesises authentic, high-dimensional time series data. CA-GAN outperforms state-of-the-art methods in a qualitative and a quantitative evaluation while avoiding mode collapse, a serious GAN failure. We perform evaluation using 7535 patients with hypotension and sepsis from two diverse, real-world clinical datasets. We show that synthetic data generated by our CA-GAN improves model fairness in Black patients as well as female patients when evaluated separately for each subpopulation. Furthermore, CA-GAN generates authentic data of the minority class while faithfully maintaining the original distribution of data, resulting in improved performance in a downstream predictive task.

Copyright: © 2025 Marchesi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Two-dimensional representations of the acute hypotension dataset for Black patients, including marginal distributions of the principal components.**
Top panels: PCA two-dimensional representation of real (red) and synthetic (blue) data, where CA-GAN provides the best overall coverage of real data distribution, while SMOTE and WGAN-GP* show evidence of reduced coverage and mode collapse. Bottom panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP, and CA-GAN. It can be seen that CA-GAN more uniformly covers the real distribution, while SMOTE does not cover a significant part of it (top right in the panel) and WGAN-GP* coverage is almost completely separated from the real data.

**Fig 2. Two-dimensional representations of the sepsis dataset for Black patients, including marginal distributions of the principal components.**
Top panels: PCA two-dimensional representation of real (red) and synthetic (blue) data, where CA-GAN provides more coverage than SMOTE (especially in the top right and bottom left part of the panel), while WGAN-GP* provides the lowest coverage. Bottom panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP, and CA-GAN. It can be seen that SMOTE follows an interpolation pattern, while CA-GAN expands into latent space, generating authentic data points while remaining within the clusters identified by t-SNE. Data generated by WGAN-GP* fall outside of the real data.

**Fig 3. Distribution plots[0mm][-3mm]AQ1[4mm][-3mm]AQ2 of each variable, overlaying real and synthetic data for acute hypotension dataset.**
Distribution of variables related to blood pressure (MAP, diastolic and systolic) is captured well by our method in comparison to WGAN-GP* and SMOTE. CA-GAN performs better also for categorical variables, while all the three methods struggle with variables with long tail, non-normal distributions. Top panel: CA-GAN. Middle panel: WGAN-GP. Bottom panel: SMOTE

**Fig 4. Kendall’s rank correlation coefficients for the real data and the data generated with CA-GAN, WGAN-GP*, and SMOTE.**
Top panel: Acute hypotension data. Bottom panel: Sepsis data

**Fig 5. Proposed architecture of our CA-GAN. The Generator and the Discriminator are two deep networks with similar structure and number of parameters.**
Both employ three stacked Bidirectional LSTMs (BILSTMs) to capture the temporal relationships of longitudinal data. They are trained together adversarially, with a minimax game. Conditioning is achieved with static labels, passed as input to both networks. The Generator also takes Gaussian noise as input and generates time-series data (synthetic patients). The discriminator evaluates the plausibility of the output of the Generator, compared with the real data.

See this image and copyright information in PMC

References

1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. doi: 10.1038/s41591-018-0300-7 - DOI - PubMed
1. World Health Organization. Global strategy on digital health 2020–2025. World Health Organization. 2021. - PMC - PubMed
1. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. doi: 10.1126/science.aax2342 - DOI - PubMed
1. Ibrahim H, Liu X, Zariffa N, Morris AD, Denniston AK. Health data poverty: an assailable barrier to equitable digital health care. Lancet Digit Health. 2021;3(4):e260–5. doi: 10.1016/S2589-7500(20)30317-4 - DOI - PubMed
1. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. jair. 2002;16:321–57. doi: 10.1613/jair.953 - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Public Library of Science
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generative AI mitigates representation bias and improves model fairness through synthetic health data

Affiliations

Generative AI mitigates representation bias and improves model fairness through synthetic health data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials