Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May;27(5):849-864.
doi: 10.1101/gr.213611.116. Epub 2017 Apr 10.

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Affiliations

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Valerie A Schneider et al. Genome Res. 2017 May.

Abstract

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Summary of GRCh38 updates. (A) Chart showing issues resolved for GRCh38 on each chromosome by issue type. Each issue represents a unique assembly evaluation and corresponding curation decision. (B) Changes in placed scaffold N50 length from GRCh37 to GRCh38. Changes on Chromosomes 5, 13, 19, and Y are <55 kbp each. (C) Addition of whole-genome sequencing components (orange bars) resolves a GRCh37 gap, consolidating the split annotation of INPP5D and restoring a missing exon (asterisk) in GRCh38. The default 50-kbp gap in GRCh37 greatly overestimates the actual amount of missing sequence (∼6 kbp). (D) Schematic of a curated collapse in GRCh38 Chr 10. Clones from two incompatible haplotypes (pink and light blue) were mixed in the GRCh37 tiling path, creating a false gap and segmental duplication involving the single copy genes TMEM236 and MRC1 (top). In GRCh38 (bottom), clones from the blue haplotype have been eliminated (∼200 kbp), closing the gap and providing the correct gene content.
Figure 2.
Figure 2.
Evaluation of assembly updates. (A,B) Plots showing the per-chromosome lengths of sequence collapse (A) and expansion (B) of the GRCh37 (green) and GRCh38 (blue) primary assembly units (from which alternate loci are excluded), based on their assembly–assembly alignment. (C) Browser view of KCNE1 on GRCh38 Chr 21. The lower panel shows a zoomed view of the top, illustrating a paralogous sequence alignment and paralogous variant (psv) overlapping SNP rs1805128 (red box), a putatively pathogenic ClinVar variant we observed remapping to multiple locations in GRCh38, due to the addition of paralogous sequence. Because previous assembly versions lack this paralog, reads may map incorrectly in this region, and the pathogenicity of the variant and associated diagnostic calls should not be based only on such analyses. (D) Plot showing the allele distribution in RP11 WGS reads for the set of GRCh37 bases located in RP11 assembly components that were flagged as putative errors because they were not observed in the 1000 Genomes phase 1 data set. (E) Ideogram showing the distribution of regions containing alternate loci scaffolds in GRCh38.
Figure 3.
Figure 3.
NA24143 read alignments to GRCh38. (A) Schematic showing the alignment of a subset of reads unmapped on GRCh37 to GRCh38. Reads align to GRCh38 at the position of components that were added to span a GRCh37 assembly gap (orange). (B) Graph showing counts of reads uniquely mapped to unchanged regions of GRCh37 that uniquely map to nonequivalent locations in GRCh38. (C) Chart describing the GRCh38 distribution of reads from B, categorized by sequence location (same or different chromosome/scaffold) and sequence type (centromeric versus noncentromeric): (OFFCEN) movement to centromeric sequence on a different chromosome; (OFF) movement to noncentromeric sequence on a different chromosome; (ONCEN) movement to centromeric sequence on the same chromosome; (ON) movement to noncentromeric sequence on the same chromosome; (TOSCAF) movement to a noncentromeric unlocalized or unplaced scaffold; (UNCEN) movement to an unplaced scaffold containing centromere-associated sequence.
Figure 4.
Figure 4.
Evaluation of CHM1 and CHM13 assemblies. (A) FRC error curve for CHM1 (left) and CHM13 (right) assemblies. CHM1_1.1 is provided for comparison with the CHM1 de novo assemblies. The x-axis is log-scaled. (B) FRC compression-expansion curve for CHM1 (left) and CHM13 (right) showing the distribution of mapped reads. Divergence from the center indicates compression (negative) and expansion (positive). (C) Heterozygous SNPs called on the CHM1 and CHM13 de novo assemblies, CHM1_1.1 and GRCh38 using NA12878 and CHM1 (left) and CHM13 (right) aligned FermiKit assemblies. The x-axis represents potential false positives, and the y-axis measures potential true positives; optimal assemblies appear in the upper left of the plot.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. - PMC - PubMed
    1. The 1000 Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. - PMC - PubMed
    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
    1. Antonacci F, Dennis MY, Huddleston J, Sudmant PH, Steinberg KM, Rosenfeld JA, Miroballo M, Graves TA, Vives L, Malig M, et al. 2014. Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability. Nat Genet 46: 1293–1302. - PMC - PubMed
    1. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11: 1005–1017. - PMC - PubMed

Publication types