Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jun 6:2025.06.05.657102.
doi: 10.1101/2025.06.05.657102.

Pangenome-aware DeepVariant

Affiliations

Pangenome-aware DeepVariant

Mobin Asri et al. bioRxiv. .

Abstract

Population-scale genomics information provides valuable prior knowledge for various genomic analyses, especially variant calling. A notable example of such application is the human pangenome reference released by the Human Pangenome Reference Consortium, which has been shown to improve read mapping and structural variant genotyping. In this work, we introduce pangenome-aware DeepVariant, a variant caller that uses a pangenome reference alongside sample-specific read alignments. It generates pileup images of both reads and pangenome haplotypes near potential variants and uses a Convolutional Neural Network to infer genotypes. This approach allows directly using a pangenome for distinguishing true variant signals from sequencing or alignment noise. We assessed its performance on various short-read sequencing platforms and read mappers. Across all settings, pangenome-aware DeepVariant outperformed the linear-reference-based DeepVariant, reducing errors by up to 25.5%. We also show that Element reads with pangenome-aware DeepVariant can achieve 23.6% more accurate variant calling performance compared to existing methods.

PubMed Disclaimer

Conflict of interest statement

Competing interest statement A.C., D.E.C., P.C., L.D., D.R.W., A.K., J.C.M., L.B., and K.S. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. M.A. was an intern at Google LLC during the study. P.E. was an intern at Roche during the study.

Figures

Figure 1:
Figure 1:. Overview of pangenome-aware DeepVariant.
The HPRC created a pangenome graph using a genetically diverse set of the assembled haplotypes. This graph or a personalized version is fed to pangenome-aware DeepVariant to augment its read pileup images with the related pangenome haplotypes. It then uses the pangenome-augmented pileup images to infer sample-specific variant calls. In the example provided in this figure, the true variant is heterozygous (T/G) with a weak signal in short reads, leading to low confidence in variant calling without using a pangenome. However, due to the predominance of the alternative allele “G” in the human population, pangenome-aware DeepVariant can call this variant with higher confidence.
Figure 2:
Figure 2:. Benchmarking Illumina and Element calls against T2T-Q100 and Platinum truth sets .
a) HG002 Illumina and Element calls are benchmarked against the T2T-Q100 truth set across the T2T high-confidence regions. Different modes of DeepVariant (DV) have been tested with vg giraffe and BWA-MEM. The x-axis labels with “Pangenome” refer to the pangenome-aware DeepVariant and for the rest, the linear-reference-based DeepVariant was used. The x-axis labels with “(Hap=full)” refers to using all 88 haplotypes in the HPRC-v1.1 pangenome and “(Hap=32)” refers to using a personalized pangenome with 32 haplotypes. b) Performance of variant calling across seven GIAB stratifications. The x-axis labels are “WG” (whole genome), “SD” (segmental duplications), “SD (>10kb)” (segmental duplications longer than 10 kb), “All TR” (all tandem repeats), “Low Map” (regions with low mappability), “CDS” (protein coding DNA sequence), “All difficult” (all regions difficult for variant calling). At the bottom of the panel the total number of SNP and indel truth variants existing for each stratification is shown. c) Similar to (b) but for Element calls. d, e and f) Similar to (a), (b), and (c) respectively but for the HG001 calls benchmarked against the Platinum truth set. (the whole-genome high-confidence bed file is also changed to the one specific for Platinum)
Figure 3:
Figure 3:. Investigating SNP errors fixed or induced by pangenome-aware DV.
All the panels in this figure are based on variant calling with HG002 Element reads that were mapped with vg giraffe and the truth set for benchmarking was T2T-Q100-v1.1. a) The counts of the fixed SNP errors in 100kb windows are plotted across the GRCh38 chromosomes. The y axis for each chromosome is scaled separately.The removed FP and rescued FN calls are shown with red and blue respectively. The bottom track for each chromosome is showing 5 different annotations; centromere (black), SDs with identity greater than 99% (orange), SDs with identity between 99% and 98% (yellow), SDs with identity lower than 95% (light gray) and the rest of the genome (white). Below the SD/Cntr annotation the tracks with teal color show the high-confidence regions for the T2T-Q100 truth set. The left horizontal barplots show the total counts of fixed variants per chromosome. b) Distribution of fixed and induced errors are shown in the p-arm of chr16. The number of variants are counted in adjacent non-overlapping windows of length 100kb. The bottom annotation and high-confidence tracks are colored similar to panel (a). c) Similar to panel (b) but for the q-arm of chr7 d) The total number of variants fixed by pangenome-aware DV across different chromosomes stratified by SD identities and centromeric regions. If there was an overlap between SD and centromere they were labeled as centromere. The bars are colored similar to panel (a). The bars for removed FP (solid) and rescued FN (hashed) are plotted separately. e) The total size of segmental duplications across different chromosomes, which are colored by identity similar to panel (a). f) The frequency of HPRC alleles are binned with the size of 0.2 on the x axis. Each fixed variant is associated with a HPRC allele frequency extracted from the raw HPRC vcf file and the number of fixed variants are counted in each frequency bin. g) Similar panel (f) but for errors induced by pangenome-aware DV.

References

    1. Sherry S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308–311 (2001). - PMC - PubMed
    1. Church D. M. et al. Modernizing Reference Genome Assemblies. PLOS Biology 9, e1001091 (2011). - PMC - PubMed
    1. Schneider V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017). - PMC - PubMed
    1. Nurk S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). - PMC - PubMed
    1. Ballouz S., Dobin A. & Gillis J. A. Is it time to change the reference genome? Genome Biology 20, 159 (2019). - PMC - PubMed

Publication types

LinkOut - more resources