Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan;51(1):30-35.
doi: 10.1038/s41588-018-0273-y. Epub 2018 Nov 19.

Assembly of a pan-genome from deep sequencing of 910 humans of African descent

Affiliations

Assembly of a pan-genome from deep sequencing of 910 humans of African descent

Rachel M Sherman et al. Nat Genet. 2019 Jan.

Erratum in

  • Author Correction: Assembly of a pan-genome from deep sequencing of 910 humans of African descent.
    Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, Boorgula MP, Chavan S, Vergara C, Ortega VE, Levin AM, Eng C, Yazdanbakhsh M, Wilson JG, Marrugo J, Lange LA, Williams LK, Watson H, Ware LB, Olopade CO, Olopade O, Oliveira RR, Ober C, Nicolae DL, Meyers DA, Mayorga A, Knight-Madden J, Hartert T, Hansel NN, Foreman MG, Ford JG, Faruque MU, Dunston GM, Caraballo L, Burchard EG, Bleecker ER, Araujo MI, Herrera-Paz EF, Campbell M, Foster C, Taub MA, Beaty TH, Ruczinski I, Mathias RA, Barnes KC, Salzberg SL. Sherman RM, et al. Nat Genet. 2019 Feb;51(2):364. doi: 10.1038/s41588-018-0335-1. Nat Genet. 2019. PMID: 30647471

Abstract

We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

PubMed Disclaimer

Conflict of interest statement

Competing Interests. The authors declare no competing financial interests.

Figures

Figure 1.
Figure 1.. Overview of methods.
Raw reads are aligned to GRCh38 and unaligned reads assembled with MaSuRCA. Assembled contigs are then filtered for contaminants with Centrifuge and contigs shorter than 1 kb are removed (blue box). Assembled contigs are placed based on their mate’s alignment locations when possible, by checking if over 95% of mates align to the same location. If such a placement is found, the exact breakpoint is determined via a nucmer alignment to the region for each end of the contig (yellow box). Contig placement locations are then compared between all individuals, nearby placements are clustered, and a representative is chosen. All contigs are then aligned to the representatives to determine which samples contain a given placed insertion. Contigs in or aligning to placed clusters are removed from the unplaced set, and the remaining unplaced contigs are aligned to one another with nucmer to remove redundancy and result in a final nonredundant unplaced set of contigs (purple box).
Figure 2.
Figure 2.. African pan-genome contig locations.
Map of the human genome showing the locations of all African pan-genome contigs, for those that could be placed accurately along one of the chromosomes. Yellow lines represent an intergenic location; blue lines represent insertion points with RNA but not exonic annotations, and red lines indicate intersections within exons. All exon-intersecting insertions are labeled with the gene name. mRNA and lncRNA gene names are reported in Supplementary Table 4. In some cases insertions are too close together for lines to be resolved; when this occurs within exons, gene names are listed in order by chromosome position. Line width is not to scale.
Figure 3.
Figure 3.
An example of an alignment which does not meet the 50% coverage, 80% identity threshold for a “reasonably good” alignment to GRCh38. The APG contig is shown at the top, with the best consistent alignments to GRCh38 in the middle. The three constituent alignments (blue, red, and yellow segments) cover 801 bases, just under 25% of the contig, with a cumulative weighted identity of 87.9%. CAAPA_113686 has a single near perfect alignment to a Chinese HX1 contig (delineated by dotted lines) covering over 80% of CAAPA_113686 at over 90% identity. The APG contig also aligns very well to the Korean assembly (not shown).

Comment in

  • Pan-African genome.
    Rusk N. Rusk N. Nat Methods. 2019 Feb;16(2):143. doi: 10.1038/s41592-019-0317-y. Nat Methods. 2019. PMID: 30700891 No abstract available.

References

    1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921, doi:10.1038/35057062 (2001). - DOI - PubMed
    1. Venter JC et al. The sequence of the human genome. Science 291, 1304–1351, doi:10.1126/science.1058040 (2001). - DOI - PubMed
    1. Schneider VA et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 27, 849–864, doi:10.1101/gr.213611.116 (2017). - DOI - PMC - PubMed
    1. Green RE et al. A draft sequence of the Neandertal genome. Science 328, 710–722, doi:10.1126/science.1188021 (2010). - DOI - PMC - PubMed
    1. E pluribus unum. Nat Methods 7, 331 (2010). - PubMed

Publication types

MeSH terms