Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov;21(11):2017-2023.
doi: 10.1038/s41592-024-02407-2. Epub 2024 Sep 11.

Personalized pangenome references

Affiliations

Personalized pangenome references

Jouni Sirén et al. Nat Methods. 2024 Nov.

Abstract

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

PubMed Disclaimer

Conflict of interest statement

Competing interests

P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. The other authors declare no competing interests.

Figures

Fig. 1 |
Fig. 1 |. Illustrating haplotype sampling at adjacent blocks in the pangenome.
a, A variation graph representing adjacent locations in the pangenome, composed of a bidirected sequence graph (top) and a set of embedded reference haplotypes (below); vertical alignment and base labels are used to indicate the correspondence between each haplotype and its path within the sequence graph; the dotted lines represent the boundary between the two blocks; for clarity, non-varying bases (those present in all haplotypes) are omitted. b, k-mers that occur once within the graph, termed graph-unique k-mers, are identified in the haplotypes; here k = 5 and graph-unique k-mers are colored red. The presence and absence of these graph-unique k-mers identifies each haplotype. c, The graph-unique k-mers are counted in the reads (here each read is a rectangle with only reads containing an informative k-mer shown), and based on counts classified as present, likely heterozygous (shown in orange), present, likely homozygous (shown in blue) or absent (all red k-mers in b not identified in the reads). d, Using the identified graph-unique k-mer classifications, a subset of reference haplotypes is selected at each location, defining a personalized pangenome reference subgraph of the larger graph (grayed nodes are not part of the subgraph, and only the shown embedded haplotypes are included). Where needed, recombinations are introduced (lightning bolt) to create contiguous haplotypes.
Fig. 2 |
Fig. 2 |. Mapping 30× NovaSeq reads for HG002 to GRCh38 (with BWA-MEM) and to HPRC graphs (with Giraffe).
The graphs (y axis) are Minigraph–Cactus graphs built using GRCh38 as the reference. For the sampled graphs, we tested sampling 4, 8, 16 and 32 haplotypes. For the v.1.1 diploid graph, 32 candidate haplotypes were used for diploid sampling. We show the overall running time and the time spent for mapping only (left), and the fraction of reads with an exact, gapless, properly paired and Mapq 60 alignment.
Fig. 3 |
Fig. 3 |. Small variants evaluation across samples HG001 to HG005.
a, The number of false positive (FPs) and false negative (FNs) indels and single-nucleotide polymorphisms (SNPs) across four different graphs, each using GRCh38 as the reference: v.1.1 filtered, v.1.1 sampled with four and eight haplotypes and v.1.1 diploid, using the Giraffe–DeepVariant pipeline. b, Comparing the Giraffe–DeepVariant using the v.1.1 diploid graph to BWA-MEM–DeepVariant and GATK best-practice pipelines, both using the GRCh38 reference. c, The performance of the Giraffe–DeepVariant pipeline using the v.1.1 diploid graph with different coverage levels of NovaSeq reads (20×, 30× and 40×). d, Comparing the number of errors using either NovaSeq 40× data or Element 36× 1,000 bp insert data; in both cases, using the Giraffe–DeepVariant pipeline with the v.1.1 diploid graph. HG005 Element sequencing data were not available for comparison.
Fig. 4 |
Fig. 4 |. SVs benchmark evaluation.
a, Precision, recall and F1 scores of both vg call and PanGenie for different pangenome reference graphs on the GIAB v.0.6 Tier1 call set. Graphs were built using GRCh38 as the reference. b, As with a but using a benchmark set of SVs created with DipCall from the T2T v.0.9 HG002 genome assembly, comparing genome wide but excluding centromeres. c, Comparing the performance of PanGenie and vg call using the 1.1 diploid graph to other genotyping methods. Illumina short reads were used with Delly, SVaBA, Scalpel, Manta and MetaSV as well as with vg call and PanGenie. Also shown are long-read methods (CuteSV, Sniffles2 (ref. 35), Hapdup and HPRC de novo assemblies).

Update of

References

    1. Eizenga JM et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020). - PMC - PubMed
    1. Garrison E et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018). - PMC - PubMed
    1. Rautiainen M & Marschall T GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020). - PMC - PubMed
    1. Sirén J et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021). - PMC - PubMed
    1. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015). - PMC - PubMed