A data-efficient strategy for building high-performing medical foundation models
- PMID: 40044818
- DOI: 10.1038/s41551-025-01365-0
A data-efficient strategy for building high-performing medical foundation models
Abstract
Foundation models are pretrained on massive datasets. However, collecting medical datasets is expensive and time-consuming, and raises privacy concerns. Here we show that synthetic data generated via conditioning with disease labels can be leveraged for building high-performing medical foundation models. We pretrained a retinal foundation model, first with approximately one million synthetic retinal images with physiological structures and feature distribution consistent with real counterparts, and then with only 16.7% of the 904,170 real-world colour fundus photography images required in a recently reported retinal foundation model (RETFound). The data-efficient model performed as well or better than RETFound across nine public datasets and four diagnostic tasks; and for diabetic-retinopathy grading, it used only 40% of the expert-annotated training data used by RETFound. We also support the generalizability of the data-efficient strategy by building a classifier for the detection of tuberculosis on chest X-ray images. The text-conditioned generation of synthetic data may enhance the performance and generalization of medical foundation models.
© 2025. The Author(s), under exclusive licence to Springer Nature Limited.
Conflict of interest statement
Competing interests: The authors declare no competing interests.
References
MeSH terms
Grants and funding
- U2001209/National Natural Science Foundation of China (National Science Foundation of China)
- 62472102/National Natural Science Foundation of China (National Science Foundation of China)
- 62372117/National Natural Science Foundation of China (National Science Foundation of China)
- 21ZR1406600/Natural Science Foundation of Shanghai (Natural Science Foundation of Shanghai Municipality)
LinkOut - more resources
Full Text Sources
