A strategy for building and using a human reference pangenome

Bastien Llamas^#¹, Giuseppe Narzisi^#², Valerie Schneider^#³, Peter A Audano⁴, Evan Biederstedt^{5

6}, Lon Blauvelt⁷, Peter Bradbury⁸, Xian Chang⁷, Chen-Shan Chin⁹, Arkarachai Fungtammasan⁹, Wayne E Clarke², Alan Cleary¹⁰, Jana Ebler¹¹, Jordan Eizenga⁷, Jonas A Sibbesen⁷, Charles J Markello⁷, Erik Garrison⁷, Shilpa Garg¹², Glenn Hickey⁷, Gerard R Lazo¹³, Michael F Lin¹⁴, Medhat Mahmoud¹⁵, Tobias Marschall¹¹, Ilia Minkin¹⁶, Jean Monlong⁷, Rajeeva L Musunuri², Sagayamary Sagayaradj^{17

18}, Adam M Novak⁷, Mikko Rautiainen¹¹, Allison Regier¹⁹, Fritz J Sedlazeck¹⁵, Jouni Siren⁷, Yassine Souilmi¹, Justin Wagner²⁰, Travis Wrightsman²¹, Toshiyuki T Yokoyama²², Qiandong Zeng²³, Justin M Zook²⁰, Benedict Paten⁷, Ben Busby³

Affiliations

¹ Australian Centre for Ancient DNA, School of Biological Sciences, Environment Institute, The University of Adelaide, Adelaide, South Australia, 5005, Australia.
² New York Genome Center, New York, NY, 10013, USA.
³ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
⁴ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA.
⁵ Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, 10065, USA.
⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02215, USA.
⁷ Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA.
⁸ Robert W. Holley Center, USDA-ARS, Ithaca, NY, 14853, USA.
⁹ DNAnexus, Mountain View, CA, 94040, USA.
¹⁰ National Center for Genome Resources 87505, Santa Fe, NM, 87505, USA.
¹¹ Max Planck Institute for Informatics, Saarbrücken, Germany.
¹² Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA.
¹³ Western Regional Research Center, USDA-ARS, Albany, CA, 94710-1105, USA.
¹⁴ mlin.net LLC, San Jose, CA, 95113, USA.
¹⁵ Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX, TX, 77030, USA.
¹⁶ Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
¹⁷ Genome Center, University of California, Davis, Davis, CA, USA.
¹⁸ BASF, West Sacramento, CA, USA.
¹⁹ McDonnell Genome Institute, Washington University in St Louis, St Louis, MO, 63108, USA.
²⁰ Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA.
²¹ Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, 14853, USA.
²² Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
²³ Laboratory Corporation of America Holdings, Westborough, MA, 01581, USA.

^# Contributed equally.

PMID: 34386196
PMCID: PMC8350888
DOI: 10.12688/f1000research.19630.2

A strategy for building and using a human reference pangenome

Bastien Llamas et al. F1000Res. 2019.

. 2019 Oct 14:8:1751.

doi: 10.12688/f1000research.19630.2. eCollection 2019.

Authors

Affiliations

¹ Australian Centre for Ancient DNA, School of Biological Sciences, Environment Institute, The University of Adelaide, Adelaide, South Australia, 5005, Australia.
² New York Genome Center, New York, NY, 10013, USA.
³ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
⁴ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, 98195, USA.
⁵ Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, 10065, USA.
⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02215, USA.
⁷ Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA.
⁸ Robert W. Holley Center, USDA-ARS, Ithaca, NY, 14853, USA.
⁹ DNAnexus, Mountain View, CA, 94040, USA.
¹⁰ National Center for Genome Resources 87505, Santa Fe, NM, 87505, USA.
¹¹ Max Planck Institute for Informatics, Saarbrücken, Germany.
¹² Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA.
¹³ Western Regional Research Center, USDA-ARS, Albany, CA, 94710-1105, USA.
¹⁴ mlin.net LLC, San Jose, CA, 95113, USA.
¹⁵ Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX, TX, 77030, USA.
¹⁶ Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
¹⁷ Genome Center, University of California, Davis, Davis, CA, USA.
¹⁸ BASF, West Sacramento, CA, USA.
¹⁹ McDonnell Genome Institute, Washington University in St Louis, St Louis, MO, 63108, USA.
²⁰ Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA.
²¹ Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, 14853, USA.
²² Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
²³ Laboratory Corporation of America Holdings, Westborough, MA, 01581, USA.

^# Contributed equally.

PMID: 34386196
PMCID: PMC8350888
DOI: 10.12688/f1000research.19630.2

Abstract

In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.

Keywords: Graph Genome; Hackathon; Pangenome; RNAseq; Structural Variant.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

**Figure 1.. Proposed graph coordinate system to represent multiple haplotypes.**
A) Example of a GFA file ( https://github.com/GFA-spec/GFA-spec) that represents a reference genome and one alternate haplotype. The first line beginning in ‘H’” is the header, with an optional 'VN' SAM-tag version number. Nodes, represented by lines starting with ‘S’, have a name in the second column and a nucleotide sequence in the third column. Edges, represented by lines starting with ‘L’, connect nodes whose sequence appears adjacent to each other in one of the haplotypes. The node names appear in the second and fourth columns, and the orientations appear in the third and fifth columns. The line beginning with ‘P’ is from GFA version 1, and encodes subgraphs and paths. B) A path file accompanying the GFA file includes paths for the reference genome and haplotype 1. The haplotype name is in column 2 and the sequence of nodes and their orientations are in column 3. The nucleotide sequence for any haplotype can be resolved by reading out the sequence for each node in the path. C) Visualization of A using path labels from B. The red path represents ref1, while the blue path represents haplotype ref1@h1.

**Figure 2.. Pipeline diagram of the mapper.**
Input reads are scanned for minimizers, which are searched against a precomputed minimizer index of the graph reference. Minimizer hits for sufficiently rare minimizers are located in graph space, and the hits for all minimizers are clustered. The clusters are extended gaplessly, with a tolerance for mismatches. If a cluster produces a single full-length gapless extension, it is output as the alignment. Otherwise, partial gapless extensions are chained together by performing alignments of the intervening sequences and graph paths that connect them.

**Figure 3.. Pipeline diagram for mapper evaluation on *Zea mays* graphs.**
After constructing graphs with vg construct and with minimap2 and seqwish (Graph method 1), we sought to simulate reads from the vg construct graph, align them to the minimap2/seqwish graph with our faster, better short read mapper with hit chaining, and then to evaluate the mapper’s accuracy based on the simulated reads’ original and realigned positions along corresponding positional paths in the two graphs.

**Figure 4.**
Adding additional haplotype from A to B. The existing sequence and coordinates remain the same even though the nodes and edges change.

See this image and copyright information in PMC

References

1. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, et al. : A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73. 10.1038/nature09534 - DOI - PMC - PubMed
1. 1000 Genomes Project Consortium, Auton A, Brooks LD, et al. : A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 - DOI - PMC - PubMed
1. Ameur A, Che H, Martin M, et al. : De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data. Genes (Basel). 2018;9(10):486. 10.3390/genes9100486 - DOI - PMC - PubMed
1. Audano PA, Sulovari A, Graves-Lindsay TA, et al. : Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019;176(3):663–75.e19. 10.1016/j.cell.2018.12.019 - DOI - PMC - PubMed
1. Brandt DY, Aguiar VR, Bitarello BD, et al. : Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda). 2015;5(5):931–941. 10.1534/g3.114.015784 - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A strategy for building and using a human reference pangenome

Affiliations

A strategy for building and using a human reference pangenome

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources