Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr;36(4):338-345.
doi: 10.1038/nbt.4060. Epub 2018 Jan 29.

Nanopore sequencing and assembly of a human genome with ultra-long reads

Affiliations

Nanopore sequencing and assembly of a human genome with ultra-long reads

Miten Jain et al. Nat Biotechnol. 2018 Apr.

Abstract

We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ∼30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ∼3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ∼6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

PubMed Disclaimer

Conflict of interest statement

M.L., N.L., J.O.G., J.T.S., J.R.T., and T.P.S. were members of the MinION access program (MAP) and have received free-of-charge flow cells and kits for nanopore sequencing for this and other studies, and travel and accommodation expenses to speak at Oxford Nanopore Technologies conferences. N.J.L. has received an honorarium to speak at an Oxford Nanopore company meeting. S.K., A.T.D., J.Q., and T.A.S. have received travel and accommodation expenses to speak at Oxford Nanopore Technologies conferences. J.T.S., J.O.G., and M.L. receive research funding from Oxford Nanopore Technologies.

Figures

Figure 1
Figure 1. Summary of data set.
(a) Read length N50s by flow cell, colored by sequencing center. Cells: DNA extracted directly from cell culture. DNA: pre-extracted DNA purchased from Coriell. UoB, Univ. Birmingham; UEA, Univ. East Anglia; UoN, Univ. Nottingham; UBC, Univ. British Columbia; UCSC, Univ. California, Santa Cruz. (b) Total yield per flow cell grouped as in a. (c) Coverage (black line) of GRCh38 reference compared to a Poisson distribution. The depth of coverage of each reference position was tabulated using samtools depth and compared with a Poisson distribution with lambda = 27.4 (dashed red line). (d) Alignment identity compared to alignment length. No length bias was observed, with long alignments having the same identity as short ones. (e) Correlation between 5-mer counts in reads compared to expected counts in the chromosome 20 reference. (f) Chromosome 20 homopolymer length versus median homopolymer base-call length measured from individual Illumina and nanopore reads (Scrappie and Metrichor). Metrichor fails to produce homopolymer runs longer than ∼5 bp. Scrappie shows better correlation for longer homopolymer runs, but tends to overcall short homopolymers (between 5 and 15 bp) and undercall long homopolymers (>15 bp). Plot noise for longer homopolymers is due to fewer samples available at that length.
Figure 2
Figure 2. Structural variation and SNP genotyping.
(a) Structural variant genotyping sensitivity using Oxford Nanopore Technologies (ONT) reads. Genotypes (GTs) were inferred for a set of 2,414 SVs using both Oxford Nanopore and Platinum Genomes (Illumina) alignments. Using alignments randomly subsampled to a given sequencing depth (n = 3), sensitivity was calculated as the proportion of ONT-derived genotypes that were concordant with Illumina-derived genotypes. (b) Confusion matrix for genotype-calling evaluation. Each cell contains the number of 1000 Genome sites for a particular nanopolish/platinum genotype combination.
Figure 3
Figure 3. Methylation detection using signal-based methods.
(a) SignalAlign methylation probabilities compared to bisulfite sequencing frequencies at all called sites. (b) Nanopolish methylation frequencies compared to bisulfite sequencing at all called sites. (c) SignalAlign methylation probabilities compared to bisulfite sequencing frequencies at sites covered by at least ten reads in the nanopore and bisulfite data sets; reads were not filtered for quality. (d) Nanopolish methylation frequencies compared to bisulfite sequencing at sites covered by at least ten reads in the nanopore and bisulfite data sets. A minimum log-likelihood threshold of 2.5 was applied to remove ambiguous reads. N = sample size, r = Pearson correlation coefficient.
Figure 4
Figure 4. Repeat modeling and assembly.
(a) A model of expected NG50 contig size when correctly resolving human repeats of a given length and identity. The y axis shows the expected NG50 contig size when repeats of a certain length (x axis) or sequence identity (colored lines) can be consistently resolved. Nanopore assembly contiguity (GM12878 20×, 30×, 35×) is currently limited by low coverage of long reads and a high error rate, making repeat resolution difficult. These assemblies approximately follow the predicted assembly contiguity. The projected assembly contiguity using 30 × of ultra-long reads (GM12878 30× ultra) exceeds 30 Mbp. A recent assembly of 65 × PacBio P6 data with an NG50 of 26 Mbp is shown for comparison (CHM1 P6). (b) Yield by read length (log10) for ligation, rapid and ultra-long rapid library preparations. (c) Chromosomes plot illustrating the contiguity of the nanopore assembly boosted with ultra-long reads. Contig and alignment boundaries, not cytogenetic bands, are represented by a color switch, so regions of continuous color indicate regions of contiguous sequence. White areas indicate unmapped sequence, usually caused by N's in the reference genome. Regions of interest, including the 12 50+ kb gaps in GRCh38 closed by our assembly as well as the MHC (16 Mbp), are outlined in red.
Figure 5
Figure 5. Ultra-long reads, assembly, and telomeres.
(a) A 16-Mbp ultra-long read contig and associated haplotigs are shown spanning the full MHC region. MHC Class I and II regions are annotated along with various HLA genes. Below this contig, the MHC region is enlarged, showing haplotype A and B coverage tracks for the phased nanopore reads. Nanopore reads were aligned back to the polished Canu contig, with colored lines indicating a high fraction of single-nucleotide discrepancies in the read pileups (as displayed by the IGV browser). The many disagreements indicate the contig is a mosaic of both haplotypes. The haplotig A and B tracks show the result of assembling each haplotype read set independently. Below this, the MHC class II region is enlarged, with haplotype A and B raw reads aligned to their corresponding, unpolished haplotigs. The few consensus disagreements between raw reads and haplotigs indicate successful partitioning of the reads into haplotypes. (b) An unresolved, 50-kb bridged scaffold gap on Xq24 remains in the GRCh38 assembly (adjacent to scaffolds AC008162.3 and AL670379.17, shown in green). This gap spans a ∼4.6-kb tandem repeat containing cancer/testis gene family 47 (CT47). This gap is closed by assembly (contig: tig00002632) and has eight tandem copies of the repeat, validated by alignment of 100 kb+ ultra-long reads also containing eight copies of the repeat (light blue with read name identifiers). One read has only six repeats, suggesting the tandem repeated units are variable between homologous chromosomes. (c) Ultra-long reads can predict telomere length. Two 100 kb+ reads that map to the subtelomeric region of the chromosome 21 q-arm, each containing 4.9–9.1 kb of the telomeric (TTAGGG_ repeat). (d) Telomere length estimates showing variable lengths between non-homologous chromosomes.

Comment in

References

    1. Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. - DOI - PubMed
    1. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. - DOI - PMC - PubMed
    1. Pushkarev D, Neff NF, Quake SR. Single-molecule sequencing of an individual human genome. Nat. Biotechnol. 2009;27:847–850. doi: 10.1038/nbt.1561. - DOI - PMC - PubMed
    1. Rothberg JM, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011;475:348–352. doi: 10.1038/nature10242. - DOI - PubMed
    1. Pendleton M, et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods. 2015;12:780–786. doi: 10.1038/nmeth.3454. - DOI - PMC - PubMed

Publication types