Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Aug 31:22:81-102.
doi: 10.1146/annurev-genom-120120-081921. Epub 2021 Apr 30.

The Need for a Human Pangenome Reference Sequence

Affiliations
Review

The Need for a Human Pangenome Reference Sequence

Karen H Miga et al. Annu Rev Genomics Hum Genet. .

Abstract

The reference human genome sequence is inarguably the most important and widely used resource in the fields of human genetics and genomics. It has transformed the conduct of biomedical sciences and brought invaluable benefits to the understanding and improvement of human health. However, the commonly used reference sequence has profound limitations, because across much of its span, it represents the sequence of just one human haplotype. This single, monoploid reference structure presents a critical barrier to representing the broad genomic diversity in the human population. In this review, we discuss the modernization of the reference human genome sequence to a more complete reference of human genomic diversity, known as a human pangenome.

Keywords: Human Genome Project; clinical genomics; diversity; pangenome.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Historic progress over the last 20 years has enabled the launch of the human pangenome reference initiative. (a) Genomic representation of the current reference human genome sequence, as determined by Green et al. (61), demonstrated that the majority of the reference human genome sequence (65%) is derived from a single bacterial artificial chromosome library (RPCI-11). Further evaluation of the reference sequence (GRCh37) revealed that it reflected DNA from one male donor who was 37% African and 57% European. (b) This underrepresentation of diversity is also reported in genome-wide association databases, where the vast majority of data are from people of European ancestry (78%). (c) Following the first release of the reference human genome sequence, there have been several large initiatives to prioritize the inclusion of more diverse participants in human genomic research. In the past 20 years, several efforts have been launched to expand both domestic and international surveys of genomic diversity as well as to develop data infrastructure and governance, with the goal of improving the implementation of actionable findings. These big-science investments and initiatives have enabled the launch of the Human Pangenome Reference Consortium, a group responsible for leading a five-year initiative that aims to enhance the reference human genome sequence in a fashion that better represents common haplotypes in the human population. Panel a adapted from Reference (127); panel b adapted from Reference (100).
Figure 2
Figure 2
Genome graphs are useful in representing differences in genomic structure. Genomic data from a population can be organized into an edge-based sequence variation graph. (a) Consistency is important. The result of sampling from a population of diverse individuals and whole-genome sequencing is a database of haplotype-phased assemblies (ideally complete, error-free, chromosome-scale sequence assemblies) for each individual. (b) These references can be collectively studied as a graph, where nodes represent sequence information and edges describe the ordering of these sequences in each assembled haplotype. Studying these data reveals sites of copy number variants (CNVs) (regions of insertions or deletions) and sites of single-nucleotide polymorphisms (SNPs), and by creating a data structure with 5′and 3′directions, one can provide an opportunity to track inversions. In addition to reporting and representing these events, there is an opportunity to determine allelic frequency and population-based association of variants.
Figure 3
Figure 3
The Human Pangenome Reference Consortium is conducting a big-science initiative that relies on the collaborative organization of a large, multidisciplinary team of geneticists, computational biologists, policy experts, and ethicists. The production effort can be subdivided into six focused areas. (a) First is population representation and sampling, where guidance on participant inclusion is provided by population geneticists, ELSI oversight, and a consent model that is transparent and respectful to communities. Participants in this initiative will provide blood, which will be used to establish cell lines. (b) Cell lines will be used for data production across multiple centers and a broad range of DNA sequencing technologies. The blue circles indicate three production centers involved in the release of long-read HiFi data (Pacific Biosciences), the green circle indicates the nanopore production center for ultra-long data (Oxford Nanopore), the purple circle indicates the production center for Hi-C data (Omni-C, Dovetail Genomics, which has a company partnership with Illumina), and the yellow circle indicates the location of our cell line biorepository (the Coriell Institute for Medical Research). (c) The resulting data management will consist of open, reproducible workflows, with the goal of identifying the best combination of genomic data and computational tools to reach finished, telomere-to-telomere genome sequence assemblies (https://dockstore.org/organizations/HumanPangenome). (d) Data will be made available as soon they are as determined to be of sufficient quality and will be hosted on AnVIL (Terra workspace: https://app.terra.bio/#workspaces/anvil-datastorage/AnVILHPRC) to offer a federated data ecosystem to collaborate with other genomic resources through the adoption of FAIR principles. (e) These assemblies are used for the development of the human pangenome reference data structure (illustrated here as a graph), with tooling, benchmarks, and workflows (illustrating read alignments to graph) to ensure the support of standard analyses in human genetic and genomic research. We thank Jordan Eizenga, from the Genomics Institute at the University of California, Santa Cruz, for sharing the multipath graph illustration. (f) Finally, the consortium will need to form global partnerships, engage in outreach, and provide education to ensure that this resource directly benefits participant communities. Abbreviations: AnVIL, Genomic Data Science Analysis, Visualization, and Informatics Lab-Space; ELSI, ethical, legal, and social implications; FAIR, findable, accessible, interoperable, reusable; HiFi, high fidelity.

References

    1. 1000 Genomes Proj. Consort. 2010. A map of human genome variation from population-scale sequencing. Nature 467:1061–73 - PMC - PubMed
    1. 1000 Genomes Proj. Consort. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65 - PMC - PubMed
    1. 1000 Genomes Proj. Consort. 2015. A global reference for human genetic variation. Nature 526:68–74 - PMC - PubMed
    1. Abe M, Ishikawa O, Miyachi Y. 1998. Lupoid sycosis successfully treated with minocycline. Br. J. Dermatol 138:199–200 - PubMed
    1. Abel HJ, Larson DE, Regier AA, Chiang C, Das I, et al.2020. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583:83–89 - PMC - PubMed

Publication types

LinkOut - more resources