Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 26;11(8):252.
doi: 10.3390/jimaging11080252.

Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures

Affiliations

Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures

Zineb Sordo et al. J Imaging. .

Abstract

Generative AI (genAI) has emerged as a powerful tool for synthesizing diverse and complex image data, offering new possibilities for scientific imaging applications. This review presents a comprehensive comparative analysis of leading generative architectures, ranging from Variational Autoencoders (VAEs) to Generative Adversarial Networks (GANs) on through to Diffusion Models, in the context of scientific image synthesis. We examine each model's foundational principles, recent architectural advancements, and practical trade-offs. Our evaluation, conducted on domain-specific datasets including microCT scans of rocks and composite fibers, as well as high-resolution images of plant roots, integrates both quantitative metrics (SSIM, LPIPS, FID, CLIPScore) and expert-driven qualitative assessments. Results show that GANs, particularly StyleGAN, produce images with high perceptual quality and structural coherence. Diffusion-based models for inpainting and image variation, such as DALL-E 2, delivered high realism and semantic alignment but generally struggled in balancing visual fidelity with scientific accuracy. Importantly, our findings reveal limitations of standard quantitative metrics in capturing scientific relevance, underscoring the need for domain-expert validation. We conclude by discussing key challenges such as model interpretability, computational cost, and verification protocols, and discuss future directions where generative AI can drive innovation in data augmentation, simulation, and hypothesis generation in scientific research.

Keywords: Generative Adversarial Networks; diffusion; generative AI; image generation; synthetic data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 8
Figure 8
(a) CycleGAN architecture containing two mapping functions F and G and two associated adversarial discriminators DY and DX. (b) Forward cycle-consistency loss: xG(x)F(G(x))x. (c) Backward cycle-consistency loss: yF(y)G(F(y))y, where blue dots referring to outputs of domain X and red dots referring to outputs of domain Y. Source: [11].
Figure 12
Figure 12
Score-based generative modeling through SDE by transforming input data to a noise distribution through a continuous-time SDE and reversing the process using the score function of the distribution at each intermediate time step. Source: [47].
Figure 13
Figure 13
Cross-attention mechanisms. (Top) Visual and textual embeddings are combined through cross-attention layers that generate spatial attention maps for each text token. (Bottom) The spatial arrangement and geometry of the generated image are guided by the attention maps from a source image. This approach allows various editing tasks to be performed solely by modifying the textual prompt. When replacing a word in the prompt, we insert the source image’s attention maps Mt, replacing the target image maps Mt*, to maintain the original spatial layout. Conversely, when adding a new phrase, we only incorporate the attention maps related to the unchanged part of the prompt. Additionally, the semantic influence of a word can be enhanced or reduced by re-weighting its corresponding attention map. Source: [57].
Figure 14
Figure 14
Diagram of the Latent Diffusion Model (LDM) architecture where the input image is encoded into a latent vector z through an encoder E, which will be the input to the forward diffusion process. The denoising U-Net ϵθ utilizes cross-attention layers to process key, query and value pairs (Q, K, V). This setup includes conditioning information through elements such as semantic maps, text and images to guide the transformation back to pixel space using the decoder block D. Source: [45].
Figure 16
Figure 16
CLIP architecture: CLIP model simultaneously trains an image encoder and a text encoder to correctly match pairs of (image, text) examples within a batch during training. During testing, the trained text encoder produces a zero-shot linear classifier by embedding the names or descriptions of the classes in the target dataset. Source: [62].
Figure 17
Figure 17
Overview of the DALL-E 2 (or unCLIP) architecture: Above the dotted line is illustrated the CLIP training process, which develops a joint representation space for both text and images. Below the dotted line is the text-to-image-generation pipeline: a CLIP text embedding is first given as input to an autoregressive or diffusion prior to generate an image embedding, which is then used to condition a diffusion decoder that creates the final image. The CLIP model remains frozen during the training of the prior and the decoder. Source: [61].
Figure 18
Figure 18
DiffEdit model diagram: first step consists in adding noise to the input image and then denoising it twice—once conditioned on the query text and once conditioned on a reference text (or unconditionally). The differences in the denoising results are used to generate a mask. In the second step, the input image is encoded using DDIM to estimate its latent representation. Finally, in the third step, DDIM decoding is performed conditioned on the text query, with the inferred mask guiding the replacement of the background pixels with values obtained from the encoding process at the corresponding timestep. Source: [66].
Figure 19
Figure 19
Diffusion Transformer (DiT) architecture: on the left, conditional latent DiT models are trained, where the input latent is divided into patches and processed through multiple DiT blocks. On the right, the DiT blocks include various configurations of standard transformer components that integrate conditioning through methods such as adaptive layer normalization, cross-attention, and additional input tokens. Among these, adaptive layer normalization proves to be the most effective. Source: [68].
Figure 20
Figure 20
Comparison of image-generation models for the fiber dataset. DCGAN was trained on root images resized to (64,64). DALL-E 2 and DALL-E 3 perform zero-shot image generation from text prompts such as x-ray image of a composite material with deformed circles as cross-sections.
Figure 21
Figure 21
Comparison of image-generation models for the root dataset. DCGAN was trained on root images resized to (64,64). DALL-E 2 and DALL-E 3 perform zero-shot image generation from text prompts such as microscopy image of entangled plant root in hydroponic system.
Figure 22
Figure 22
Comparison of image-generation models for the root dataset. DCGAN was trained on rock images resized to (64,64). DALL-E 2 and DALL-E 3 perform zero-shot image generation from text prompts such as microCT scan of rock sample containing large grains.
Figure 1
Figure 1
Publications of image-generation papers over the last 15 years. (a) Publication trends collected from all public sources. (b) Publication trends excluding those from Arxiv.
Figure 2
Figure 2
Image-generation pipeline: The Input stage processes a combination of text/prompt and scientific images. Next, a single Architecture (VAE, GAN, or Diffusion) is employed based on this input. Finally, Output assessment can be performed either qualitatively, by visualizing the generated image, or quantitatively, using metrics such as SSIM, LPIPS, FID, and CLIPScore.
Figure 3
Figure 3
Variational inference and generative process in the VAE.
Figure 4
Figure 4
VAE encoder–decoder architecture.
Figure 5
Figure 5
Vanilla GAN architecture, illustrating the generator (taking a noise vector as input) and discriminator (evaluating real and generated images individually).
Figure 6
Figure 6
Architecture of the Conditional GAN, where the vector in green is associated to the conditional or label vector. Source: [36].
Figure 7
Figure 7
Architecture of the generator block of the DCGAN model, composed of convolutional blocks and taking as input a latent vector and outputs a synthetic image. Source: [37].
Figure 9
Figure 9
(Left) Traditional generator architecture takes a noise vector z as input, a (right) style-based generator with an additional mapping network f and an intermediate latent space W that controls the generator through AdaIN at each convolution layer. wW is added through a learned affine transform “A”. Gaussian noise is added after each convolution, before evaluating the nonlinearity through “B”, which applies learned per-channel scaling factors to the noise input. Source: [12].
Figure 10
Figure 10
Diffusion model based on DDPMs: (Top) forward process; (Bottom) reverse process. Source: [45].
Figure 11
Figure 11
Score-based generative modeling with score matching and Langevin dynamics. Source: [47].
Figure 15
Figure 15
InstructPix2Pix method based on training data generation and Diffusion Model training. (a) Fine-tuning GPT-3 to produce editing instructions alongside modified captions. (b) These caption pairs are fed into Stable Diffusion with Prompt-to-Prompt guidance to generate corresponding image pairs. (c) This process results in a dataset with over 450,000 training samples. (d) The authors train the InstructPix2Pix Diffusion Model on this dataset to perform image edits based on textual instructions. During inference, the model can generalize to real-world images and follow human-written editing commands. Source: [58].

References

    1. Sordo Z., Chagnon E., Ushizima D. A Review on Generative AI For Text-To-Image and Image-To-Image Generation and Implications To Scientific Images. arXiv. 2025cs.CV/2502.21151
    1. Foster D. Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play. 2nd ed. O’Reilly Media; Sebastopol, CA, USA: 2023.
    1. Zhou L., Schellaert W., Martínez-Plumed F., Moros-Daval Y., Ferri C., Hernández-Orallo J. Larger and more instructable language models become less reliable. Nature. 2024;634:61–68. doi: 10.1038/s41586-024-07930-y. - DOI - PMC - PubMed
    1. Sun Y., Sheng D., Zhou Z., Wu Y. AI hallucination: Towards a comprehensive classification of distorted information in artificial intelligence-generated content. Humanit. Soc. Sci. Commun. 2024;11:1278. doi: 10.1057/s41599-024-03811-x. - DOI
    1. Lucas J.S., Maung B.M., Tabar M., McBride K., Lee D. The Longtail Impact of Generative AI on Disinformation: Harmonizing Dichotomous Perspectives. IEEE Intell. Syst. 2024;39:12–19. doi: 10.1109/MIS.2024.3439109. - DOI