Personalized pangenome references

Jouni Sirén¹, Parsa Eskandar², Matteo Tommaso Ungaro^{2

3}, Glenn Hickey², Jordan M Eizenga², Adam M Novak², Xian Chang², Pi-Chuan Chang⁴, Mikhail Kolmogorov⁵, Andrew Carroll⁴, Jean Monlong^{2

6}, Benedict Paten⁷

Affiliations

¹ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. jlsiren@ucsc.edu.
² UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
³ University of Ferrara, Ferrara, Italy.
⁴ Google LLC, Mountain View, CA, USA.
⁵ Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
⁶ Institut de Recherche en Santé Digestive, Université de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France.
⁷ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. bpaten@ucsc.edu.

PMID: 39261641
PMCID: PMC12643174
DOI: 10.1038/s41592-024-02407-2

Personalized pangenome references

Jouni Sirén et al. Nat Methods. 2024 Nov.

. 2024 Nov;21(11):2017-2023.

doi: 10.1038/s41592-024-02407-2. Epub 2024 Sep 11.

Authors

Affiliations

¹ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. jlsiren@ucsc.edu.
² UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
³ University of Ferrara, Ferrara, Italy.
⁴ Google LLC, Mountain View, CA, USA.
⁵ Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
⁶ Institut de Recherche en Santé Digestive, Université de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France.
⁷ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. bpaten@ucsc.edu.

PMID: 39261641
PMCID: PMC12643174
DOI: 10.1038/s41592-024-02407-2

Abstract

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

PubMed Disclaimer

Conflict of interest statement

Competing interests

P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. The other authors declare no competing interests.

Figures

**Fig. 1 |. Illustrating haplotype sampling at adjacent blocks in the pangenome.**
a, A variation graph representing adjacent locations in the pangenome, composed of a bidirected sequence graph (top) and a set of embedded reference haplotypes (below); vertical alignment and base labels are used to indicate the correspondence between each haplotype and its path within the sequence graph; the dotted lines represent the boundary between the two blocks; for clarity, non-varying bases (those present in all haplotypes) are omitted. b, k-mers that occur once within the graph, termed graph-unique k-mers, are identified in the haplotypes; here k = 5 and graph-unique k-mers are colored red. The presence and absence of these graph-unique k-mers identifies each haplotype. c, The graph-unique k-mers are counted in the reads (here each read is a rectangle with only reads containing an informative k-mer shown), and based on counts classified as present, likely heterozygous (shown in orange), present, likely homozygous (shown in blue) or absent (all red k-mers in b not identified in the reads). d, Using the identified graph-unique k-mer classifications, a subset of reference haplotypes is selected at each location, defining a personalized pangenome reference subgraph of the larger graph (grayed nodes are not part of the subgraph, and only the shown embedded haplotypes are included). Where needed, recombinations are introduced (lightning bolt) to create contiguous haplotypes.

**Fig. 2 |. Mapping 30× NovaSeq reads for HG002 to GRCh38 (with BWA-MEM) and to HPRC graphs (with Giraffe).**
The graphs (y axis) are Minigraph–Cactus graphs built using GRCh38 as the reference. For the sampled graphs, we tested sampling 4, 8, 16 and 32 haplotypes. For the v.1.1 diploid graph, 32 candidate haplotypes were used for diploid sampling. We show the overall running time and the time spent for mapping only (left), and the fraction of reads with an exact, gapless, properly paired and Mapq 60 alignment.

**Fig. 3 |. Small variants evaluation across samples HG001 to HG005.**
a, The number of false positive (FPs) and false negative (FNs) indels and single-nucleotide polymorphisms (SNPs) across four different graphs, each using GRCh38 as the reference: v.1.1 filtered, v.1.1 sampled with four and eight haplotypes and v.1.1 diploid, using the Giraffe–DeepVariant pipeline. b, Comparing the Giraffe–DeepVariant using the v.1.1 diploid graph to BWA-MEM–DeepVariant and GATK best-practice pipelines, both using the GRCh38 reference. c, The performance of the Giraffe–DeepVariant pipeline using the v.1.1 diploid graph with different coverage levels of NovaSeq reads (20×, 30× and 40×). d, Comparing the number of errors using either NovaSeq 40× data or Element 36× 1,000 bp insert data; in both cases, using the Giraffe–DeepVariant pipeline with the v.1.1 diploid graph. HG005 Element sequencing data were not available for comparison.

**Fig. 4 |. SVs benchmark evaluation.**
a, Precision, recall and F1 scores of both vg call and PanGenie for different pangenome reference graphs on the GIAB v.0.6 Tier1 call set. Graphs were built using GRCh38 as the reference. b, As with a but using a benchmark set of SVs created with DipCall from the T2T v.0.9 HG002 genome assembly, comparing genome wide but excluding centromeres. c, Comparing the performance of PanGenie and vg call using the 1.1 diploid graph to other genotyping methods. Illumina short reads were used with Delly, SVaBA, Scalpel, Manta and MetaSV as well as with vg call and PanGenie. Also shown are long-read methods (CuteSV, Sniffles2 (ref. 35), Hapdup and HPRC de novo assemblies).

See this image and copyright information in PMC

Update of

Personalized Pangenome References.
Sirén J, Eskandar P, Ungaro MT, Hickey G, Eizenga JM, Novak AM, Chang X, Chang PC, Kolmogorov M, Carroll A, Monlong J, Paten B. Sirén J, et al. bioRxiv [Preprint]. 2023 Dec 15:2023.12.13.571553. doi: 10.1101/2023.12.13.571553. bioRxiv. 2023. Update in: Nat Methods. 2024 Nov;21(11):2017-2023. doi: 10.1038/s41592-024-02407-2. PMID: 38168361 Free PMC article. Updated. Preprint.

References

1. Eizenga JM et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020). - PMC - PubMed
1. Garrison E et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018). - PMC - PubMed
1. Rautiainen M & Marschall T GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020). - PMC - PubMed
1. Sirén J et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021). - PMC - PubMed
1. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Personalized pangenome references

Affiliations

Personalized pangenome references

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous