Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug;12(8):780-6.
doi: 10.1038/nmeth.3454. Epub 2015 Jun 29.

Assembly and diploid architecture of an individual human genome via single-molecule technologies

Affiliations

Assembly and diploid architecture of an individual human genome via single-molecule technologies

Matthew Pendleton et al. Nat Methods. 2015 Aug.

Abstract

We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

The authors declare competing financial interests: details are available in the online version of the paper.

Figures

Figure 1
Figure 1
De novo assembly and scaffold layout. PacBio sequence contigs. Genome maps and scaffold V2 are shown in order from the top of each chromosome, with the hg19 reference at the bottom. Possible chimeras identified by comparison of sequence contigs and genome maps (but not those that persist in the V2 scaffold) are indicated in cyan (flagged assembly). Ideogram and Giemsa banding for hg19 is plotted at the bottom of each chromosome in grayscale, with centromeres highlighted in light red. ‘N’ gaps in hg19 are shaded with gray in the background of all assemblies and scaffolds.
Figure 2
Figure 2
Tandem-repeat detection from single molecules predicts a large divergence from reference. (a) Tandem-repeat span comparisons between predicted NA12878 alleles and hg19. (b) Length comparisons of each predicted heterozygous tandem-repeat locus in NA12878. (c) Copy-number difference at the LPA kringle domain (light red) between NA12878 (blue) and hg19 reference (green; chr6, chromosome 6). Spanning molecules (yellow) confirm that an expansion has occurred. In the molecule pileup view, dark blue represents mapped molecule labels, and red represents unmapped labels. Each tick on the scale represents a distance of 50 kb. (d) Left, a dot plot showing an expansion within a tandem repeat versus hg19. Right, a self-self dot plot of NA12878 indicates that the insertion contains repeated sequences that diverge from the original AAG repeat.
Figure 3
Figure 3
De novo maps identify large structural variants. (a,b) Alignment of genome maps (blue) to in silico maps of hg19 (green) for a 206-kb insertion at 5p13.2 (a) and a 577-kb inversion at 1q32.1 (b). Below each event, all of the individual long molecules spanning the region of interest are shown to confirm homozygosity of the predicted event. The insertion locus in a and the boundaries of the predicted inversion in b are highlighted in light red. The predicted inversion (and resolution of gapped sequences) is consistent with the updated hg38 assembly.
Figure 4
Figure 4
CLRs highlight multiple colocated SVs and complex SV structures. Dot plots of a single error-corrected read (y axis) versus the corresponding reference regions (x-axis) for complex events in NA12878. Above each dot plot are gene annotations, known repeats (including short interspersed elements (SINE), long interspersed elements (LINEs), long terminal repeats (LTRs)) and other biologically relevant features. (a) Chromosome 1 (Chr1):44058631–44061135, inversion with a trailing insertion and deletion (supported by 17/31 spanning raw reads). (b) Chr5:147552243–147555736, inversion with preceding and trailing deletion (20/34). The larger deletion eliminates an exon in SPINK14. (c) Chr4:146613545–146616773, inversion with potential duplication (6/11). (d) Chr5:17711870–17715038, proximally duplicated substring (10/26). (e) Chr1:143664130–143668633, a complex region with multiple events (9/34), including deletion of neighboring AluSG and AluU elements, expansion of a small tandem repeat and insertion of an AluY element at a nearby location.

References

    1. Zook JM, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32:246–251. - PubMed
    1. Lam HYK, et al. Performance comparison of whole-genome sequencing platforms. Nat Biotechnol. 2012;30:78–82. - PMC - PubMed
    1. Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. - PMC - PubMed
    1. Istrail S, et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc Natl Acad Sci USA. 2004;101:1916–1921. - PMC - PubMed
    1. Gnerre S, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA. 2011;108:1513–1518. - PMC - PubMed