. 2023 May;617(7960):312-324.

doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.

A draft human pangenome reference

Wen-Wei Liao^#^{1

2

3}, Mobin Asri^#⁴, Jana Ebler^#^{5

6}, Daniel Doerr^{5

6}, Marina Haukness⁴, Glenn Hickey⁴, Shuangjia Lu^{1

2}, Julian K Lucas⁴, Jean Monlong⁴, Haley J Abel⁷, Silvia Buonaiuto⁸, Xian H Chang⁴, Haoyu Cheng^{9

10}, Justin Chu⁹, Vincenza Colonna^{8

11}, Jordan M Eizenga⁴, Xiaowen Feng^{9

10}, Christian Fischer¹¹, Robert S Fulton^{12

13}, Shilpa Garg¹⁴, Cristian Groza¹⁵, Andrea Guarracino^{11

16}, William T Harvey¹⁷, Simon Heumos^{18

19}, Kerstin Howe²⁰, Miten Jain²¹, Tsung-Yu Lu²², Charles Markello⁴, Fergal J Martin²³, Matthew W Mitchell²⁴, Katherine M Munson¹⁷, Moses Njagi Mwaniki²⁵, Adam M Novak⁴, Hugh E Olsen⁴, Trevor Pesout⁴, David Porubsky¹⁷, Pjotr Prins¹¹, Jonas A Sibbesen²⁶, Jouni Sirén⁴, Chad Tomlinson¹², Flavia Villani¹¹, Mitchell R Vollger^{17

27}, Lucinda L Antonacci-Fulton¹², Gunjan Baid²⁸, Carl A Baker¹⁷, Anastasiya Belyaeva²⁸, Konstantinos Billis²³, Andrew Carroll²⁸, Pi-Chuan Chang²⁸, Sarah Cody¹², Daniel E Cook²⁸, Robert M Cook-Deegan²⁹, Omar E Cornejo³⁰, Mark Diekhans⁴, Peter Ebert^{5

6

31}, Susan Fairley²³, Olivier Fedrigo³², Adam L Felsenfeld³³, Giulio Formenti³², Adam Frankish²³, Yan Gao³⁴, Nanibaa' A Garrison^{35

36

37}, Carlos Garcia Giron²³, Richard E Green^{38

39}, Leanne Haggerty²³, Kendra Hoekzema¹⁷, Thibaut Hourlier²³, Hanlee P Ji⁴⁰, Eimear E Kenny⁴¹, Barbara A Koenig⁴², Alexey Kolesnikov²⁸, Jan O Korbel^{23

43}, Jennifer Kordosky¹⁷, Sergey Koren⁴⁴, HoJoon Lee⁴⁰, Alexandra P Lewis¹⁷, Hugo Magalhães^{5

6}, Santiago Marco-Sola^{45

46}, Pierre Marijon^{5

6}, Ann McCartney⁴⁴, Jennifer McDaniel⁴⁷, Jacquelyn Mountcastle³², Maria Nattestad²⁸, Sergey Nurk⁴⁴, Nathan D Olson⁴⁷, Alice B Popejoy⁴⁸, Daniela Puiu⁴⁹, Mikko Rautiainen⁴⁴, Allison A Regier¹², Arang Rhie⁴⁴, Samuel Sacco³⁰, Ashley D Sanders⁵⁰, Valerie A Schneider⁵¹, Baergen I Schultz³³, Kishwar Shafin²⁸, Michael W Smith³³, Heidi J Sofia³³, Ahmad N Abou Tayoun^{52

53}, Françoise Thibaud-Nissen⁵¹, Francesca Floriana Tricomi²³, Justin Wagner⁴⁷, Brian Walenz⁴⁴, Jonathan M D Wood²⁰, Aleksey V Zimin^{49

54}, Guillaume Bourque^{55

56

57}, Mark J P Chaisson²², Paul Flicek²³, Adam M Phillippy⁴⁴, Justin M Zook⁴⁷, Evan E Eichler^{17

58}, David Haussler^{4

58}, Ting Wang^{12

13}, Erich D Jarvis^{32

58

59}, Karen H Miga⁴, Erik Garrison⁶⁰, Tobias Marschall^{61

62}, Ira M Hall^{63

64}, Heng Li^{65

66}, Benedict Paten⁶⁷

Affiliations

¹ Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
² Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
³ Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA.
⁴ Genomics Institute, University of California, Santa Cruz, CA, USA.
⁵ Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
⁶ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
⁷ Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA.
⁸ Institute of Genetics and Biophysics, National Research Council, Naples, Italy.
⁹ Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
¹⁰ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹¹ Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
¹² McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA.
¹³ Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA.
¹⁴ Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark.
¹⁵ Quantitative Life Sciences, McGill University, Montréal, Québec, Canada.
¹⁶ Genomics Research Centre, Human Technopole, Milan, Italy.
¹⁷ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
¹⁸ Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany.
¹⁹ Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany.
²⁰ Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK.
²¹ Northeastern University, Boston, MA, USA.
²² Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
²³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
²⁴ Coriell Institute for Medical Research, Camden, NJ, USA.
²⁵ Department of Computer Science, University of Pisa, Pisa, Italy.
²⁶ Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark.
²⁷ Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA.
²⁸ Google, Mountain View, CA, USA.
²⁹ Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA.
³⁰ Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA.
³¹ Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
³² Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
³³ National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA.
³⁴ Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA.
³⁵ Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA.
³⁶ Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.
³⁷ Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.
³⁸ Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA.
³⁹ Dovetail Genomics, Scotts Valley, CA, USA.
⁴⁰ Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA.
⁴¹ Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁴² Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA.
⁴³ Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
⁴⁴ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
⁴⁵ Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain.
⁴⁶ Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain.
⁴⁷ Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
⁴⁸ Department of Public Health Sciences, University of California, Davis, CA, USA.
⁴⁹ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁵⁰ Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany.
⁵¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
⁵² Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE.
⁵³ Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE.
⁵⁴ Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
⁵⁵ Department of Human Genetics, McGill University, Montréal, Québec, Canada.
⁵⁶ Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada.
⁵⁷ Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan.
⁵⁸ Howard Hughes Medical Institute, Chevy Chase, MD, USA.
⁵⁹ Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.
⁶⁰ Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA. egarris5@uthsc.edu.
⁶¹ Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.
⁶² Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.
⁶³ Department of Genetics, Yale University School of Medicine, New Haven, CT, USA. ira.hall@yale.edu.
⁶⁴ Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA. ira.hall@yale.edu.
⁶⁵ Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA. hli@jimmy.harvard.edu.
⁶⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. hli@jimmy.harvard.edu.
⁶⁷ Genomics Institute, University of California, Santa Cruz, CA, USA. bpaten@ucsc.edu.

^# Contributed equally.

PMID: 37165242
PMCID: PMC10172123
DOI: 10.1038/s41586-023-05896-x

A draft human pangenome reference

Wen-Wei Liao et al. Nature. 2023 May.

. 2023 May;617(7960):312-324.

doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.

Authors

Affiliations

¹ Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
² Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
³ Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA.
⁴ Genomics Institute, University of California, Santa Cruz, CA, USA.
⁵ Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
⁶ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
⁷ Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA.
⁸ Institute of Genetics and Biophysics, National Research Council, Naples, Italy.
⁹ Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
¹⁰ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹¹ Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
¹² McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA.
¹³ Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA.
¹⁴ Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark.
¹⁵ Quantitative Life Sciences, McGill University, Montréal, Québec, Canada.
¹⁶ Genomics Research Centre, Human Technopole, Milan, Italy.
¹⁷ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
¹⁸ Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany.
¹⁹ Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany.
²⁰ Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK.
²¹ Northeastern University, Boston, MA, USA.
²² Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
²³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
²⁴ Coriell Institute for Medical Research, Camden, NJ, USA.
²⁵ Department of Computer Science, University of Pisa, Pisa, Italy.
²⁶ Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark.
²⁷ Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA.
²⁸ Google, Mountain View, CA, USA.
²⁹ Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA.
³⁰ Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA.
³¹ Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
³² Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
³³ National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA.
³⁴ Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA.
³⁵ Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA.
³⁶ Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.
³⁷ Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.
³⁸ Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA.
³⁹ Dovetail Genomics, Scotts Valley, CA, USA.
⁴⁰ Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA.
⁴¹ Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁴² Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA.
⁴³ Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
⁴⁴ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
⁴⁵ Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain.
⁴⁶ Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain.
⁴⁷ Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
⁴⁸ Department of Public Health Sciences, University of California, Davis, CA, USA.
⁴⁹ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁵⁰ Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany.
⁵¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
⁵² Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE.
⁵³ Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE.
⁵⁴ Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
⁵⁵ Department of Human Genetics, McGill University, Montréal, Québec, Canada.
⁵⁶ Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada.
⁵⁷ Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan.
⁵⁸ Howard Hughes Medical Institute, Chevy Chase, MD, USA.
⁵⁹ Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.
⁶⁰ Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA. egarris5@uthsc.edu.
⁶¹ Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.
⁶² Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.
⁶³ Department of Genetics, Yale University School of Medicine, New Haven, CT, USA. ira.hall@yale.edu.
⁶⁴ Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA. ira.hall@yale.edu.
⁶⁵ Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA. hli@jimmy.harvard.edu.
⁶⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. hli@jimmy.harvard.edu.
⁶⁷ Genomics Institute, University of California, Santa Cruz, CA, USA. bpaten@ucsc.edu.

^# Contributed equally.

PMID: 37165242
PMCID: PMC10172123
DOI: 10.1038/s41586-023-05896-x

Abstract

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals¹. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.

PubMed Disclaimer

Conflict of interest statement

E.E.E. is a scientific advisory board (SAB) member of Variant Bio. P.F is a member of the SABs of Fabric Genomics and Eagle Genomics. E.E.K. is a member of the SAB of Encompass Biosciences, Foresite Labs and Galateo Bio and has received personal fees from Regeneron Pharmaceuticals, 23&Me and Illumina. A.B., A.C., P.-C.C., D.E.C., G.Baid, A.K., M.N. and K.S. are employees of Google and own Alphabet stock as part of the standard compensation package.

Figures

**Fig. 1. Presenting 47 accurate and near-complete diverse diploid human genome assemblies.**
a, Selecting the HPRC samples. Left, the first two principal components of 1KG samples showing HPRC (triangles) samples, excluding HG002, HG005 and NA21309. Right, summary of the HPRC sample subpopulations (three letter abbreviations) on a map of Earth as defined by the 1KG. ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest US; CHS, Han Chinese South; CLM, Colombian in Medellin, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division; KHV, Kinh in Ho Chi Minh City, Vietnam; MKK, Maasai in Kinyawa, Kenya; MSL, Mende in Sierra Leone; PEL, Peruvian in Lima, Peru; PJL, Punjabi in Lahore, Pakistan; PUR, Puerto Rican in Puerto Rico; YRI, Yoruba in Ibadan, Nigeria. b, Interchromosomal joins between acrocentric chromosome short arms. Red, the join is on the same strand; blue, otherwise. c, Total assembled sequence per haploid phased assembly. d, Assembly contiguity shown as a NGx plot. T2T-CHM13 and GRCh38 contigs are included for comparison. e, Assembly QVs showing the base-level accuracy of the maternal and paternal assembly for each sample. f, Yak-reported phasing accuracy showing the switch error percentage versus Hamming error percentage. g, Flagger read-based assembly evaluation pipeline. Coverage is calculated across the genome and a mixture model is fit to account for reliably assembled haploid sequence and various classes of unreliably assembled sequence. For each coverage block, a label is assigned according to the most probable mixture component to which it belongs: erroneous, falsely duplicated, (reliable) haploid, collapsed, and unknown. h, Reliability of the 47 HPRC assemblies using read mapping. For each sample, the left bar is the paternal and the right bar is the maternal haplotype. Regions flagged as haploid are reliable (green), constituting more than 99% on average of each assembly. The y axis is broken to show the dominance of the reliable haploid component and the stratification of the unreliable blocks. i, Assembly reliability of six types of repeats. AlphaSat, alpha satellites; HSat2/3, human satellites 2 and 3. j, Completeness of the HPRC assemblies relative to T2T-CHM13. The number of reference bases covered by none, by one, by two or by more than two alignments are included.

**Fig. 2. Transcriptome annotation of the assemblies.**
a, Ensembl mapping pipeline results. Percentages of protein-coding and noncoding genes and transcripts annotated from the reference set in each of the HPRC assemblies. Orange points represent T2T-CHM13 for comparison. b, Frequency of gene copy number. Individual genes may have separate copy number states among genomes, and the frequency reflects 3,210 observed copy number changes among the HPRC genomes. c, Number of distinct duplicated genes or gene families per phased assembly relative to the number of duplicated genes annotated in GRCh38 (n = 152). The GRCh38 gene duplications reflect families of duplicated genes, whereas the counts in other genomes reflect gene duplication polymorphisms. The assemblies are colour coded according to their population of origin. d, The top 25 most commonly CNV genes or gene families in the HPRC assemblies out of all 1,115 duplicated genes, ordered by the number of samples with additional copies relative to GRCh38. Grey bars, the number of samples with additional copies. Blue circles, the number of additional copies per sample, with the size of the circle proportional to the number of samples. e, The top 30 most individually copied CNV genes or gene families in the HPRC assemblies, ordered by total number of additional copies observed. Blue circles, the number of additional copies per sample. Grey bars, the total number of additional copies summed over the samples. f, Dotplot illustrating haplotype-resolved *GPRIN2* gains in the HG01361 assembly relative to GRCh38. g, Dotplot illustrating *SPDYE2*–*SPDYE2B* haplotype resolved gains within a tandem duplication cluster of the HG00621 assembly relative to GRCh38.

**Fig. 3. Pangenome graphs represent diverse variation.**
a, A pangenome variation graph comprising two elements: a sequence graph, the nodes of which represent oriented DNA strings and bidirected edges represent the connectivity relationships; and embedded haplotype paths (coloured lines) that represent the individual assemblies. b, Small variant sites in pangenome graphs stratified by the variant type and by the number of alleles at each site. MNP, multinucleotide polymorphism. c, SV sites in the pangenome graphs stratified by repeat class and by the number of alleles at each site. Other TE, a site involving mixed classes of transposable elements (TEs). VNTR, variable-number tandem repeat, a tandem repeat with the unit motif length ≥7 bp. STR, short tandem repeat, a tandem repeat with the unit motif length ≤6 bp. Other LCR, low-complexity regions with mixed VNTR and STR and low-complexity regions without a clear VNTR or STR pattern. Other repeat, a site involving mixed classes of repeats. SegDup, segmental duplication. Low repeat, a small fraction of the longest allele in a site involving repeats. d, Pangenome minor AF (MAF) spectrum for biallelic SNP, VNTR, L1 and Alu variants in the MC and PGGB graphs. **e,f**, Number of autosomal small variants per sample (e) and SVs per haplotype (f) in the pangenome. Variants were restricted to the Dipcall-confident regions. Samples are organized by 1KG populations. g, Pangenome growth curves for MC (left) and PGGB (right). Depth measures how often a segment is contained in any haplotype sequence, whereby core is present in ≥95% of haplotypes, common is ≥5%. h, Small variants in the GIAB (v.3.0) ‘easy’ regions annotated with AFs from gnomAD (v.3.1.2).

**Fig. 4. Pangenome graph evaluation.**
**a,b**, Precision and recall of autosomal small variants (a) and SVs (b) in the pangenomes relative to consensus variant sets. Small variants are compared to HiFi–DeepVariant calls. SVs are compared to the consensus of six reference-based SV callers (Methods). Comparisons are restricted to the Dipcall-confident regions and then stratified by the GIAB (v.3.0) genomic context. c, Average SV precision, recall and frequency in the Dipcall-confident regions stratified by length in the MC (top) and PGGB (bottom) graphs relative to consensus SV sets. The histogram bin size is 50 bp for SVs <1 kb and 500 bp for SVs ≥1 kb.

**Fig. 5. Visualizing complex pangenome loci.**
a–c, Structural haplotypes of *RHD* and *RHCE* from the MC graph. Locations of *RHD* and *RHCE* within the graph (a). The colour gradient is based on the precise relative position of each gene; green, head of a gene; blue, end of a gene. The lines alongside the graph are based on the approximate position of gene bodies, including exons and transcription start sites. Different structural haplotypes take different paths through the graph (b). The colour gradient and lines show the path of each allele; red, start of a path; blue, end of a path. Frequency and linear structural visualization of all structural haplotypes called by the graph among 90 haploid assemblies (c). Asterisks indicate newly discovered haplotypes. d–f, Structural haplotypes of *HLA-A* from the PGGB graph, visualized using the same conventions as a–c. del, deletion; ins, insertion; inv, inversion.

**Fig. 6. Performance gains for pangenome-aided analysis of short-read WGS data.**
a,b, Precision–recall curves showing the performance of different combinations of linear reference and various mappers and variant callers evaluated against the GIAB (v.4.2.1) HG005 benchmark (a) and the challenging medically relevant genes (CMRG; v.1.0) benchmark (b). Giraffe uses the MC pangenome graph, BWA-MEM uses GRCh38 and Dragen Graph uses GRCh38 with additional alternative haplotype sequences. c, Comparison of AFs observed from the PanGenie genotypes for all 2,504 unrelated 1KG samples and the AFs observed across 44 of the HPRC assembly samples in the MC graph. The PanGenie genotypes include all variants contained in the filtered set (28,433 deletions, 84,755 insertions, 32,431 other alleles). d, Number of SVs present (genotype 0/1 or 1/1) in each of the 3,202 1KG samples in the filtered HPRC genotypes (PanGenie) after merging similar alleles (n = 100,442 SVs), the HGSVC lenient set (n = 52,659 SVs) and the 1KG Illumina calls (n = 172,968 SVs) in GIAB regions. In the box plots, lower and upper limits represent the first and third quartiles of the data, the white dots represent the median and the black lines mark minima and maxima of the data points. e, Length distribution of SV insertions and SV deletions contained in the filtered HPRC genotypes (PanGenie), the HGSVC lenient set and the 1KG Illumina calls. Only variants with a common AF > 5% across the 3,202 samples were considered.

**Extended Data Fig. 1. Characterizing uncovered reference bases using peri/centromeric annotation and evaluating the completeness of different satellite families.**
We characterized the regions not covered by the assembly alignments to the T2T-CHM13 (v.2.0) reference and also investigated the completeness of the peri/centromeric satellites across all HPRC assemblies. We characterized these regions using the peri/centromeric annotation available for the T2T-CHM13 (v.2.0) reference. We made separate bar plots for male and female samples to exclude chromosome X for the paternal assemblies of male samples and exclude chromosome Y for all other assemblies. Panels a and b indicate that on average ~90% of the uncovered bases are located in peri/centromeric regions with the active/inactive alpha satellites and human satellite 3 comprising ~50% of these bases, mainly due to their highly repetitive composition and also higher frequency compared to other satellites. Other centromeric satellites, centromeric transition regions, and rDNA arrays accounted for another ~40% of the uncovered bases on average. Panels c and d display the average lengths of uncovered regions located within each satellite family. Panels e and f show what percentage of each satellite family was covered by at least one assembly alignment. The most complete centromeric regions (~90% coverage) are divergent/monomeric alpha satellites, gamma satellites and centromeric transition regions. The rDNA arrays have been covered by ~8% on average, which made them the least completely assembled repeat arrays.

**Extended Data Fig. 2. Segmental duplication reliability.**
a, Average number of Mbp per haplotype of correctly or incorrectly assembled SDs lifted from T2T-CHM13 (v.2.0). b, The features of the most identical and longest overlapping SDs for each type of assembly error calculated in 5 kbp windows.

**Extended Data Fig. 3. The differences in pangenome graph construction methods for Minigraph, MC, and PGGB.**
a, Two haplotypes (H₁ and H₂) vary in copy number of a chromosomal segment S. The S₁, S₂, and S₃ segments are highly similar with only a SNP or a small indel. b, Pangenome graph structures for Minigraph, MC, and PGGB. Minigraph used H₁ as an initial backbone and then augmented with SVs (≥50 bp) from H₂, such that the SNP in S₂ is not represented in the pangenome graph. MC added small variants (<50 bp) to the pangenome graph constructed by Minigraph. PGGB used a symmetric, all-by-all alignment of haplotypes to build a pangenome graph whose structure is not affected by the order of inputs (unlike Minigraph and MC). The critical difference in graph construction is that, due to ambiguous pairwise relationships of paralogs, PGGB tends to collapse copy-number polymorphic loci like segmental duplications and VNTRs into a single copy through which haplotypes loop, while Minigraph and MC do not.

**Extended Data Fig. 4. HiFi read depth of on- and off-target edges in the MC graph.**
Left: fraction of reads aligned to the pangenome graph after filtering low-quality alignments. Middle: read depth distribution of on-target edges. Right: read depth distribution of off-target edges. Samples are sorted by sequencing coverage (Supplementary Table 1).

**Extended Data Fig. 5. Gene mapping in the pangenome graphs.**
The first three show the percentage of protein-coding genes from GENCODE (v.38) able to be mapped in the gene annotation sets from Ensembl, CAT run on the MC graph based on GRCh38, and CAT run on the PGGB graph. The second three show the percentage of noncoding genes from GENCODE (v.38) able to be mapped on the same annotation sets.

**Extended Data Fig. 6. Structural haplotypes of *CYP2D6* and *CYP2D7* from the MC graph.**
a, Locations of *CYP2D6* and *CYP2D7* within the graph. The colour gradient is based on the precise relative position of each gene; green, head of a gene; blue, end of a gene. b, Different structural haplotypes take different paths through the graph. The colour gradient and lines show the path of each allele; red, start of a path; blue, end of a path. c, Frequency and linear structural visualization of all structural haplotypes called by the graph among 90 haploid assemblies.

**Extended Data Fig. 7. Performance comparison of pangenome-based variant calling and read mapping across populations.**
a, Number of variants with at least one alternate allele (i.e. excluding homozygous for the reference allele) for each in the 1KG samples. The number of variants in the 1KG callset (x-axis) are compared to the variants found when aligning reads to the HPRC pangenome and calling variants with DeepVariant (y-axis). Points (samples) are coloured by their super-population label from the 1KG. b, The proportion of mapped reads that align perfectly (y-axis) is shown for a subset of samples from the 1KG, ordered by the number of variants called (x-axis). Two mapping approaches are compared: mapping short reads to GRCh38 with BWA (green); mapping to the HPRC pangenome with Giraffe (orange). The samples were selected to span the x-axis.

**Extended Data Fig. 8. Improved genotyping in the challenging medically-relevant gene *RHCE*.**
a, Gene annotation of part of the *RHCE* gene. b, Genotyping performance in this region for three approaches (horizontal panels). The top panel, using the HPRC pangenome, shows the best performance with most variants being true positives (TP, blue points) based on the CMRG (v.1.0) truth set while more other methods have a higher number of false negatives (FN, red points). c, Allele frequency across 2,504 unrelated individuals of the 1KG. The HPRC-Giraffe-DeepVariant calls show higher frequencies. In particular, the gene-converted alleles, at about 25.406-25.410 Mbp, are observed at ~25% frequency, similar to estimates from the HPRC haplotypes (Fig. 5a–c). **d,e**, A pangenomic view of the gene-converted region showing 1 of 4 haplotypes in the HPRC pangenome supporting the non-reference alleles. The inclusion of this haplotype in the HPRC pangenome enables short sequencing reads, here from HG002, to map along this gene-converted haplotype.

**Extended Data Fig. 9. Leave-one-out experiment.**
A leave-one-out experiment was conducted by repeatedly removing one of the assembly-samples from the panel VCF and genotyping it based on the remaining samples. Plots show the resulting weighted genotype concordances for different variant allele classes. a, weighted genotype concordances are stratified by graph complexity: biallelic regions of the MC graph include only bubbles with two branches, and multiallelic regions include all bubbles with > 2 different alternative paths defined by the 88 haplotypes. b, results of the same experiment stratified by different genomic regions defined by the GIAB.

**Extended Data Fig. 10. Additional applications supported by the pangenome reference.**
a, Performance of read alignment in VNTR regions using the MC graph versus GRCh38. All statistics are expressed relative to the total number of reads simulated from each genome. b, Performance of RNA-seq read alignment. Mapping rate and false discovery rate are stratified by mapping quality producing the curves shown. The MC graph is compared to a graph derived from the 1KG variant calls and to GRCh38. Each reference is augmented with splice junctions. vg mpmap was used to map to the graphs, and STAR was used to map to the linear reference. c, Proportion of all ChIP-seq peaks that are called only in the MC graph. Each data point represents samples that were assayed for H3K4me1, H3K27ac histone marks or chromatin accessibility using ATAC-seq. d, H3K4me1 peaks that overlap an SV for which the sample is heterozygous. The reads within the peak are partitioned between the SV or reference allele. The red boundary represents regions where a binomial test assigns a peak to the SV allele, both alleles, or the reference allele.

**Extended Data Fig. 11. Number of SVs per sample in the HPRC PanGenie filtered set as well as the 1KG Illumina calls for all 3,202 1KG samples.**
Samples are coloured by superpopulation. The left plot excludes the african superpopulation, while the right plot shows the same results including african samples and including the assembly samples present in the graph (marked by a black circle).

See this image and copyright information in PMC

Comment in

Human pangenome supports analysis of complex genomic regions.
Massarat A, Gymrek M, McStay B, Jónsson H. Massarat A, et al. Nature. 2023 May;617(7960):256-258. doi: 10.1038/d41586-023-01490-3. Nature. 2023. PMID: 37165235 No abstract available.
New Genomic Sequencing Resource Could Improve Care.
[No authors listed] [No authors listed] Cancer Discov. 2023 Jul 7;13(7):1506-1507. doi: 10.1158/2159-8290.CD-NB2023-0042. Cancer Discov. 2023. PMID: 37249320

References

1. 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. - DOI - PMC - PubMed
1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
1. Nurk S, et al. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. - DOI - PMC - PubMed
1. Aganezov S, et al. A complete reference genome improves analysis of human genetic variation. Science. 2022;376:eabl3533. doi: 10.1126/science.abl3533. - DOI - PMC - PubMed
1. Ebert P, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021;372:eabf7117. doi: 10.1126/science.abf7117. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A draft human pangenome reference

Affiliations

A draft human pangenome reference

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources