. 2025 Jul 26;11(8):252.

doi: 10.3390/jimaging11080252.

Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures

Zineb Sordo¹, Eric Chagnon¹, Zixi Hu¹, Jeffrey J Donatelli¹, Peter Andeer², Peter S Nico³, Trent Northen², Daniela Ushizima^{1

4

5}

Affiliations

¹ Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
² Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
³ Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
⁴ Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, USA.
⁵ Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA.

PMID: 40863462
PMCID: PMC12387873
DOI: 10.3390/jimaging11080252

Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures

Zineb Sordo et al. J Imaging. 2025.

. 2025 Jul 26;11(8):252.

doi: 10.3390/jimaging11080252.

Authors

Zineb Sordo¹, Eric Chagnon¹, Zixi Hu¹, Jeffrey J Donatelli¹, Peter Andeer², Peter S Nico³, Trent Northen², Daniela Ushizima^{1

4

5}

Affiliations

¹ Applied Math and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
² Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
³ Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
⁴ Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, USA.
⁵ Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA.

PMID: 40863462
PMCID: PMC12387873
DOI: 10.3390/jimaging11080252

Abstract

Generative AI (genAI) has emerged as a powerful tool for synthesizing diverse and complex image data, offering new possibilities for scientific imaging applications. This review presents a comprehensive comparative analysis of leading generative architectures, ranging from Variational Autoencoders (VAEs) to Generative Adversarial Networks (GANs) on through to Diffusion Models, in the context of scientific image synthesis. We examine each model's foundational principles, recent architectural advancements, and practical trade-offs. Our evaluation, conducted on domain-specific datasets including microCT scans of rocks and composite fibers, as well as high-resolution images of plant roots, integrates both quantitative metrics (SSIM, LPIPS, FID, CLIPScore) and expert-driven qualitative assessments. Results show that GANs, particularly StyleGAN, produce images with high perceptual quality and structural coherence. Diffusion-based models for inpainting and image variation, such as DALL-E 2, delivered high realism and semantic alignment but generally struggled in balancing visual fidelity with scientific accuracy. Importantly, our findings reveal limitations of standard quantitative metrics in capturing scientific relevance, underscoring the need for domain-expert validation. We conclude by discussing key challenges such as model interpretability, computational cost, and verification protocols, and discuss future directions where generative AI can drive innovation in data augmentation, simulation, and hypothesis generation in scientific research.

Keywords: Generative Adversarial Networks; diffusion; generative AI; image generation; synthetic data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 8**
(a) CycleGAN architecture containing two mapping functions F and G and two associated adversarial discriminators $D_{Y}$ and $D_{X}$ . (b) Forward cycle-consistency loss: $x \to G (x) \to F (G (x)) \approx x$ . (c) Backward cycle-consistency loss: $y \to F (y) \to G (F (y)) \approx y$ , where blue dots referring to outputs of domain X and red dots referring to outputs of domain Y. Source: [11].

**Figure 12**
Score-based generative modeling through SDE by transforming input data to a noise distribution through a continuous-time SDE and reversing the process using the score function of the distribution at each intermediate time step. Source: [47].

**Figure 13**
Cross-attention mechanisms. (**Top**) Visual and textual embeddings are combined through cross-attention layers that generate spatial attention maps for each text token. (**Bottom**) The spatial arrangement and geometry of the generated image are guided by the attention maps from a source image. This approach allows various editing tasks to be performed solely by modifying the textual prompt. When replacing a word in the prompt, we insert the source image’s attention maps $M_{t}$ , replacing the target image maps $M_{t}^{*}$ , to maintain the original spatial layout. Conversely, when adding a new phrase, we only incorporate the attention maps related to the unchanged part of the prompt. Additionally, the semantic influence of a word can be enhanced or reduced by re-weighting its corresponding attention map. Source: [57].

**Figure 14**
Diagram of the Latent Diffusion Model (LDM) architecture where the input image is encoded into a latent vector $z$ through an encoder $E$ , which will be the input to the forward diffusion process. The denoising U-Net $ϵ_{θ}$ utilizes cross-attention layers to process key, query and value pairs (Q, K, V). This setup includes conditioning information through elements such as semantic maps, text and images to guide the transformation back to pixel space using the decoder block $D$ . Source: [45].

**Figure 16**
CLIP architecture: CLIP model simultaneously trains an image encoder and a text encoder to correctly match pairs of (image, text) examples within a batch during training. During testing, the trained text encoder produces a zero-shot linear classifier by embedding the names or descriptions of the classes in the target dataset. Source: [62].

**Figure 17**
Overview of the DALL-E 2 (or unCLIP) architecture: *Above the dotted line* is illustrated the CLIP training process, which develops a joint representation space for both text and images. *Below the dotted line* is the text-to-image-generation pipeline: a CLIP text embedding is first given as input to an autoregressive or diffusion prior to generate an image embedding, which is then used to condition a diffusion decoder that creates the final image. The CLIP model remains frozen during the training of the prior and the decoder. Source: [61].

**Figure 18**
DiffEdit model diagram: first step consists in adding noise to the input image and then denoising it twice—once conditioned on the query text and once conditioned on a reference text (or unconditionally). The differences in the denoising results are used to generate a mask. In the second step, the input image is encoded using DDIM to estimate its latent representation. Finally, in the third step, DDIM decoding is performed conditioned on the text query, with the inferred mask guiding the replacement of the background pixels with values obtained from the encoding process at the corresponding timestep. Source: [66].

**Figure 19**
Diffusion Transformer (DiT) architecture: on the left, conditional latent DiT models are trained, where the input latent is divided into patches and processed through multiple DiT blocks. On the right, the DiT blocks include various configurations of standard transformer components that integrate conditioning through methods such as adaptive layer normalization, cross-attention, and additional input tokens. Among these, adaptive layer normalization proves to be the most effective. Source: [68].

**Figure 20**
Comparison of image-generation models for the fiber dataset. DCGAN was trained on root images resized to (64,64). DALL-E 2 and DALL-E 3 perform zero-shot image generation from text prompts such as *x-ray image of a composite material with deformed circles* as cross-sections.

**Figure 21**
Comparison of image-generation models for the root dataset. DCGAN was trained on root images resized to (64,64). DALL-E 2 and DALL-E 3 perform zero-shot image generation from text prompts such as *microscopy image of entangled plant root in hydroponic system*.

**Figure 22**
Comparison of image-generation models for the root dataset. DCGAN was trained on rock images resized to (64,64). DALL-E 2 and DALL-E 3 perform zero-shot image generation from text prompts such as *microCT scan of rock sample containing large grains*.

**Figure 1**
Publications of image-generation papers over the last 15 years. (a) Publication trends collected from all public sources. (b) Publication trends excluding those from Arxiv.

**Figure 2**
Image-generation pipeline: The Input stage processes a combination of text/prompt and scientific images. Next, a single Architecture (VAE, GAN, or Diffusion) is employed based on this input. Finally, Output assessment can be performed either qualitatively, by visualizing the generated image, or quantitatively, using metrics such as SSIM, LPIPS, FID, and CLIPScore.

**Figure 3**
Variational inference and generative process in the VAE.

**Figure 4**
VAE encoder–decoder architecture.

**Figure 5**
Vanilla GAN architecture, illustrating the generator (taking a noise vector as input) and discriminator (evaluating real and generated images individually).

**Figure 6**
Architecture of the Conditional GAN, where the vector in *green* is associated to the conditional or label vector. Source: [36].

**Figure 7**
Architecture of the generator block of the DCGAN model, composed of convolutional blocks and taking as input a latent vector and outputs a synthetic image. Source: [37].

**Figure 9**
(**Left**) Traditional generator architecture takes a noise vector $z$ as input, a (**right**) style-based generator with an additional mapping network f and an intermediate latent space W that controls the generator through AdaIN at each convolution layer. $w \in W$ is added through a learned affine transform “A”. Gaussian noise is added after each convolution, before evaluating the nonlinearity through “B”, which applies learned per-channel scaling factors to the noise input. Source: [12].

**Figure 10**
Diffusion model based on DDPMs: (**Top**) forward process; (**Bottom**) reverse process. Source: [45].

**Figure 11**
Score-based generative modeling with score matching and Langevin dynamics. Source: [47].

**Figure 15**
InstructPix2Pix method based on training data generation and Diffusion Model training. (a) Fine-tuning GPT-3 to produce editing instructions alongside modified captions. (b) These caption pairs are fed into Stable Diffusion with Prompt-to-Prompt guidance to generate corresponding image pairs. (c) This process results in a dataset with over 450,000 training samples. (d) The authors train the InstructPix2Pix Diffusion Model on this dataset to perform image edits based on textual instructions. During inference, the model can generalize to real-world images and follow human-written editing commands. Source: [58].

See this image and copyright information in PMC

References

1. Sordo Z., Chagnon E., Ushizima D. A Review on Generative AI For Text-To-Image and Image-To-Image Generation and Implications To Scientific Images. arXiv. 2025cs.CV/2502.21151
1. Foster D. Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play. 2nd ed. O’Reilly Media; Sebastopol, CA, USA: 2023.
1. Zhou L., Schellaert W., Martínez-Plumed F., Moros-Daval Y., Ferri C., Hernández-Orallo J. Larger and more instructable language models become less reliable. Nature. 2024;634:61–68. doi: 10.1038/s41586-024-07930-y. - DOI - PMC - PubMed
1. Sun Y., Sheng D., Zhou Z., Wu Y. AI hallucination: Towards a comprehensive classification of distorted information in artificial intelligence-generated content. Humanit. Soc. Sci. Commun. 2024;11:1278. doi: 10.1057/s41599-024-03811-x. - DOI
1. Lucas J.S., Maung B.M., Tabar M., McBride K., Lee D. The Longtail Impact of Generative AI on Disinformation: Harmonizing Dichotomous Perspectives. IEEE Intell. Syst. 2024;39:12–19. doi: 10.1109/MIS.2024.3439109. - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures

Affiliations

Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources