Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;41(Supplement_1):i484-i492.
doi: 10.1093/bioinformatics/btaf209.

Generating synthetic genotypes using diffusion models

Affiliations

Generating synthetic genotypes using diffusion models

Philip Kenneweg et al. Bioinformatics. .

Abstract

Summary: In this paper, we introduce the first diffusion model designed to generate complete synthetic human genotypes, which, by standard protocols, one can straightforwardly expand into full-length, DNA-level genomes. The synthetic genotypes mimic real human genotypes without just reproducing known genotypes, in terms of approved metrics. When training biomedically relevant classifiers with synthetic genotypes, accuracy is near-identical to the accuracy achieved when training classifiers with real data. We further demonstrate that augmenting small amounts of real with synthetically generated genotypes drastically improves performance rates. This addresses a significant challenge in translational human genetics: real human genotypes, although emerging in large volumes from genome wide association studies, are sensitive private data, which limits their public availability. Therefore, the integration of additional, insensitive data when striving for rapid sharing of biomedical knowledge of public interest appears imperative.

Availability and implementation: All non proprietary data and the code to replicate the experiments is available on Github.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the pre-processing pipeline. Genes, which consist of between 5 and 100 SNPs are each processed by a custom PCA. This is done independently for each Gene.
Figure 2.
Figure 2.
A structural overview of the architecture of the MLP diffusion model. The input is passed through several down sampling and up sampling blocks. Each down sampling block has a skip connection to its respective up sampling counterpart. Conditioning y, t is injected in the up projection blocks.
Figure 3.
Figure 3.
We display validation loss (a) and reconstruction error (b) e.g. ||xpx|| during training using single shot denoising for varying UNet architectures as part of the diffusion model. The validation loss in (a) decreases during training for all models but with different magnitudes, for the MLP model the decrease is negligible. In (b) the validation reconstruction error decreases evenly for all the models.

References

    1. Ahronoviz S, Gronau I. Genome-ac-gan: enhancing synthetic genotype generation through auxiliary classification. bioRxiv, 2024, preprint: not peer reviewed.
    1. Auer PL, Johnsen JM, Johnson AD et al. Imputation of exome sequence variants into population-based samples and blood-cell-trait-associated loci in African Americans: NHLBI go exome sequencing project. Am J Hum Genet 2012;91:794–808. - PMC - PubMed
    1. Auton A, Brooks LD, Durbin RM et al. ; 1000 Genomes Project Consortium (Co-Chair). A global reference for human genetic variation. Nature 2015;526:68–74. - PMC - PubMed
    1. Avdeyev P, Shi C, Tan Y et al. Dirichlet diffusion score model for biological sequence generation. In: The Fortieth International Conference on Machine Learning, Hawai‘i Convention Center, Honolulu, HI. PMLR, 2023, 1276–301.
    1. Azizi S, Kornblith S, Saharia C et al. Synthetic data from diffusion models improves imagenet classification. Trans Mach Learn Res 2023. https://openreview.net/forum? id=DlRsoxjyPm

LinkOut - more resources