Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr;9(4):494-506.
doi: 10.1038/s41551-024-01246-y. Epub 2024 Aug 26.

A vision-language foundation model for the generation of realistic chest X-ray images

Affiliations

A vision-language foundation model for the generation of realistic chest X-ray images

Christian Bluethgen et al. Nat Biomed Eng. 2025 Apr.

Abstract

The paucity of high-quality medical imaging datasets could be mitigated by machine learning models that generate compositionally diverse images that faithfully represent medical concepts and pathologies. However, large vision-language models are trained on natural images, and the diversity distribution of the generated images substantially differs from that of medical images. Moreover, medical language involves specific and semantically rich vocabulary. Here we describe a domain-adaptation strategy for large vision-language models that overcomes distributional shifts. Specifically, by leveraging publicly available datasets of chest X-ray images and the corresponding radiology reports, we adapted a latent diffusion model pre-trained on pairs of natural images and text descriptors to generate diverse and visually plausible synthetic chest X-ray images (as confirmed by board-certified radiologists) whose appearance can be controlled with free-form medical text prompts. The domain-adaptation strategy for the text-conditioned synthesis of medical images can be used to augment training datasets and is a viable alternative to the sharing of real medical images for model training and fine-tuning.

PubMed Disclaimer

Conflict of interest statement

Competing interests: T.M.A. and S.P. are employees of Stability AI. C.P.L. reports activities not related to the present article: Board of directors and shareholder, Bunkerhill Health 3/31/2019, Option holder, Whiterabbit.ai 10/01/2017, Advisor and option holder, GalileoCDS 05/01/2019, Advisor and option holder, Sirona Medical 07/06/2020, Advisor and option holder, Adra 09/17/2020, Advisor and option holder, Kheiron 10/21/2021, Paid consultant, Sixth Street 02/07/2022, Paid consultant, Gilmartin Capital 07/18/2022. Recent grant and gift support paid to C.P.L.’s institution: BunkerHill Health, Carestream, CARPL, Clairity, GE Healthcare, Google Cloud, IBM, Kheiron, Lambda, Lunit, Microsoft, Nightingale Open Science, Philips, Siemens Healthineers, Stability.ai, Subtle Medical, VinBrain, Visiana, Whiterabbit.ai, Lowenstein Foundation, Gordon and Betty Moore Foundation. A.S.C. discloses consulting services to Patient Square Capital, Elucid Bioimaging, Skope MR, Culvert Engineering, Edge Analytics, Image Analysis Group and Chondrometrics GmbH; and is shareholder in LVIS Corp., Subtle Medical and Brain Key. The other authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. Intra-prompt diversity of synthetic CXRs.
Three generated samples per prompt are shown, using the prompts “Right-sided pleural effusion” (top row) and “Left-sided pleural effusion” (bottom row). Numbers on the three right columns represent pair-wise MS-SSIM values calculated between each image and the first image.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Multi-label classification performance of DenseNet-121 trained on increasing amounts of real, synthetic or a blend of real and synthetic CXRs.
AUROC is macro-averaged across the labels available for all test datasets (Atelectasis, Cardiomegaly, Consolidation, Pleural Effusion, Pneumothorax, Pneumonia, No Finding). Light areas indicate 95% confidence intervals.
Extended Data Fig. 3 |
Extended Data Fig. 3 |. Effect of additional synthetic training data on multi-label classification performance.
DenseNet-121 was trained on real and varying amounts of synthetic CXRs, which were obtained by sampling RoentGen multiple (2–4) times per prompt, using 1%, 10% and 100% of DDev,PA as underlying dataset and text prompt source. The dashed horizontal lines indicate the classification performance of a model trained solely on real CXRs. AUROC is macro-averaged across the labels available for all test datasets (Atelectasis, Cardiomegaly, Consolidation, Pleural Effusion, Pneumothorax, Pneumonia, No Finding). Light areas indicate 95% confidence intervals.
Extended Data Fig. 4 |
Extended Data Fig. 4 |. Educational use of RoentGen to illustrate signs and pathologies.
Left column, Synthetic CXRs created with the prompts “middle lobe pneumonia” and “right lower lobe pneumonia”, showcasing the “loss of silhouette sign” that can help to distinguish middle lobe pneumonia (silhouette absent) from lower lobe pneumonia (silhouette present) on CXR. Center column, Synthetic CXR showing an opacity at the right hilus, and corresponding diffusion attentive attribution map (DAAM) for the expression “hilar mass”. Right column, Synthetic CXR showing another example of discrete middle lobe pneumonia with an air bronchogram (that is, visible endobronchial air against a background of increased lung opacity, top right image), with the main area of the DAAM highlighting the corresponding opacities in the middle lobe (bottom right image).
Fig. 1 |
Fig. 1 |. Text-to-image synthesis of CXR images using RoentGen, a medical domain-adapted LDM based on the SD pipeline.
A conditional denoising U-Net iteratively denoises a latent vector sampled from a Gaussian distribution over t timesteps. The process is conditioned via cross-attention (QKV for query, key, value of the attention process) through embeddings created from short medical free-text inputs processed by a text encoder ET. The decoder D of the VAE maps the denoised latent vector to pixel space, resulting in a high-fidelity CXR image showing imaging features corresponding to the initial text prompt.
Fig. 2 |
Fig. 2 |. Text-conditional synthesis of CXRs.
Samples were created by prompting a fine-tuned model (60k training steps; learning rate 5 × 10−5; PA view) for typical CXR abnormalities. The generated CXRs feature high levels of detail: when prompted for ‘edema’ (top right), perihilar haziness (white arrowheads) and peribronchial cuffing (black arrowhead), both features seen in pulmonary oedema, can be observed. For ‘pneumothorax’ (bottom row, right image), a fine line representing the visceral pleural lining of the partially collapsed lung can be delineated (dashed line). The dotted regions were added for visualization.
Fig. 3 |
Fig. 3 |. Text-conditional appearance of radiological findings.
Here, presence or absence of a finding (pleural effusions, dotted regions of interest added for visualization) and dimensions such as size and laterality were controlled via prompting. Note that the model correctly incorporated the radiological convention of displaying the right patient side on the left side of the image, and vice versa. Each image was picked out of four generated CXRs per respective prompt. Colored text indicates modifiers for size (red), affected side (blue) and negation of a finding (green).
Fig. 4 |
Fig. 4 |. Combination of multiple abnormalities in synthetic CXRs.
The left column contains images created by using the respective caption as prompt, while the middle and right columns highlight findings through an overlay of diffusion attentive attribution maps, which are created by upscaling and aggregating cross-attention word–pixel scores.

References

    1. Rombach R, Blattmann A, Lorenz D, Esser P & Ommer B High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10674–10685 (IEEE, 2022).
    1. Ramesh A, Dhariwal P, Nichol A, Chu C & Chen M Hierarchical text-conditional image generation with CLIP latents. Preprint at https://arxiv.org/abs/2204.06125v1 (2022).
    1. Saharia C et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022).
    1. Schuhmann C et al. LAION-5B: an open large-scale dataset for training next generation imagetext models. In Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo S. et al.) 25278–25294 (Curran Associates, Inc., 2022)
    1. Bommasani R et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258v3 (2022).

LinkOut - more resources