Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Aug;25(1):77-104.
doi: 10.1146/annurev-genom-021623-081639. Epub 2024 Aug 6.

Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References

Affiliations
Review

Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References

Dylan J Taylor et al. Annu Rev Genomics Hum Genet. 2024 Aug.

Abstract

The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.

Keywords: genetic diversity; pangenome; precision medicine; reference genome sequence; telomere-to-telomere.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the process of human genome sequence assembly. (a) Sequence length improvements to the human genome reference sequence over time. (b) Overview of the genome sequence assembly process. First, individual sequencing reads are generated from a sample. Then, the sequencing reads are compared with each other to identify overlaps. Overlapping reads are then merged to generate a genome sequence. (c) Overview of pangenome reference assembly and analysis. First, the pangenome is assembled from multiple individual genome sequences, revealing commonalities and differences among them. Later, sequencing reads generated from other samples can be mapped (or aligned) to the pangenome reference to detect variants and establish genotypes. (d) Applications for human genome analysis. Abbreviations: SNP, single-nucleotide polymorphism; T2T, telomere-to-telomere. Panel a adapted from Reference ; panels c and d adapted from Reference .
Figure 2
Figure 2
Overview of the first complete human genome assembly. (a) Ideogram of the T2T-CHM13v2.0 genome assembly. Regions of the assembly that are nonsyntenic with GRCh38 based on a whole-genome alignment between the two assemblies are shown in blue. (b) Breakdown of the sequence classes present in the regions of T2T-CHM13 that are nonsyntenic with GRCh38 (Y chromosome not included). (c) Mappability of the T2T-CHM13v2.0 genome based on minimum unique k-mer size, broken down by synteny with GRCh38. At each position in the genome, the minimum unique k-mer size is defined as the minimum number of bases (to the right) necessary to yield a unique sequence that does not appear elsewhere in the genome. Larger sizes imply poor mappability with short sequencing reads. (d) Performance of long- and short-read-based variant identification for a set of challenging medically relevant genes using T2T-CHM13 versus GRCh38. (e) Example of a medically relevant gene exhibiting improved mapping and variant identification using T2T-CHM13. KCNJ18 falls within a collapsed duplicated region in GRCh38, which results in excessive read depth and spurious variants being identified; this is corrected using T2T-CHM13. Abbreviations: CenSat, centromeric satellite; indel, insertion or deletion; ONT, Oxford Nanopore Technologies; RepMask, RepeatMasker; SD, segmental duplication; SNP, single-nucleotide polymorphism; T2T, telomere-to-telomere. Panel b adapted from Reference ; panels d and e adapted from Reference .
Figure 3
Figure 3
Illustrating the HPRC pangenome with an example. (a) The structural haplotypes of the CYP2D6 and CYP2D7 genes called from the Minigraph-Cactus HPRC pangenome graph. The color gradients are based on the relative positions of the genes: Green represents the head of a gene, and blue represents the end of a gene. (b) Different paths taken by different structural haplotypes in the graph. The color gradient is based on path position: Red represents the head of a path, and blue represents the end of a path. (c) Frequency and linear structural visualization of all structural haplotypes called by the Minigraph-Cactus graph. Abbreviation: HPRC, Human Pangenome Reference Consortium. Figure adapted from Reference (CC BY 4.0) with assistance from Shuangjia Lu.
Figure 4
Figure 4
Broader applications of pangenomes. (a) Mapping simulated RNA-sequencing reads to a spliced reference (dashed lines) or spliced pangenome (solid lines). STAR takes only known splicing information into account, while HISAT2 and the vg toolkit also further integrate genetic variants, which results in substantially fewer incorrectly mapped sequencing reads. (b) Genotyping SVs from the HGSVC catalog using different pangenome-based approaches. This panel shows wGC values in nonrepetitive regions, at different coverages (point size), for sample NA12878, which was removed from the catalog for a leave-one-out evaluation. Complex SVs are all variant sites that are not biallelic deletions or insertions. PanGenie is able to genotype the vast majority of SVs accurately. Abbreviations: HGSVC, Human Genome Structural Variation Consortium; HISAT2, Hierarchical Indexing for Spliced Alignment of Transcripts 2; STAR, Spliced Transcripts Alignment to a Reference; SV, structural variant; vg, variation graphs; wGC, weighted genotype concordance. Panel a adapted from Reference ; panel b adapted from Reference (CC BY 4.0).
Figure 5
Figure 5
Opportunities and needs for pangenome research. (a) Density of SVs (≥50 bp) across T2T-CHM13. The variants are sourced from the HPRC-MC VCF file (93). Colored bands indicate genomic annotations for T2T-CHM13. (b) Growth of common and individual-specific sequences within the 910 individuals of the African pangenome population (130), where common sequences are defined as sequences present in at least two samples included in the pangenome. The orange line represents the average sizes of the common sequences in a certain number of individuals after randomly sampling 1,000 times. The blue line represents the average sizes of individual-specific sequences from the same samples. (c) Overview of pangenome efforts. Abbreviations: HPRC, Human Pangenome Reference Consortium; MC, Minigraph-Cactus; SV, structural variant; T2T, telomere-to-telomere; VCF, Variant Call Format. Pangenome construction illustration adapted from Reference (CC BY 4.0); right-hand outreach and education illustration provided by Darryl Leja/National Human Genome Research Institute (public domain).

References

    1. 1000 Genomes Proj. Consort. 2015. A global reference for human genetic variation. Nature 526(7571):68–74 - PMC - PubMed
    1. Abel HJ, Larson DE, Regier AA, Chiang C, Das I, et al. 2020. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583(7814):83–89 - PMC - PubMed
    1. Abondio P, Cilli E, Luiselli D. 2023. Human pangenomics: promises and challenges of a distributed genomic reference. Life 13(6):1360. - PMC - PubMed
    1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. 2000. The genome sequence of Drosophila melanogaster. Science 287(5461):2185–95 - PubMed
    1. Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, et al. 2022. A complete reference genome improves analysis of human genetic variation. Science 376(6588):eabl3533. - PMC - PubMed