Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 9;20(1):23.
doi: 10.1186/s12864-018-5381-7.

Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing

Affiliations

Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing

Sarah Goldstein et al. BMC Genomics. .

Abstract

Background: Short-read sequencing technologies have made microbial genome sequencing cheap and accessible. However, closing genomes is often costly and assembling short reads from genomes that are repetitive and/or have extreme %GC content remains challenging. Long-read, single-molecule sequencing technologies such as the Oxford Nanopore MinION have the potential to overcome these difficulties, although the best approach for harnessing their potential remains poorly evaluated.

Results: We sequenced nine bacterial genomes spanning a wide range of GC contents using Illumina MiSeq and Oxford Nanopore MinION sequencing technologies to determine the advantages of each approach, both individually and combined. Assemblies using only MiSeq reads were highly accurate but lacked contiguity, a deficiency that was partially overcome by adding MinION reads to these assemblies. Even more contiguous genome assemblies were generated by using MinION reads for initial assembly, but these assemblies were more error-prone and required further polishing. This was especially pronounced when Illumina libraries were biased, as was the case for our strains with both high and low GC content. Increased genome contiguity dramatically improved the annotation of insertion sequences and secondary metabolite biosynthetic gene clusters, likely because long-reads can disambiguate these highly repetitive but biologically important genomic regions.

Conclusions: Genome assembly using short-reads is challenged by repetitive sequences and extreme GC contents. Our results indicate that these difficulties can be largely overcome by using single-molecule, long-read sequencing technologies such as the Oxford Nanopore MinION. Using MinION reads for assembly followed by polishing with Illumina reads generated the most contiguous genomes with sufficient accuracy to enable the accurate annotation of important but difficult to sequence genomic features such as insertion sequences and secondary metabolite biosynthetic gene clusters. The combination of Oxford Nanopore and Illumina sequencing can therefore cost-effectively advance studies of microbial evolution and genome-driven drug discovery.

Keywords: Genome assembly; Genome sequencing; Insertion sequences; Oxford Nanopore MinION; Secondary metabolites.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

N/A

Consent for publication

N/A

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
MinION reads improve assembly contiguity. The number of contigs (left), N50 (in Mbp, center), and assembly length (in Mbp, right) are shown for each of the MiSeq-based (SPAdes, Unicycler, SPAdes-hybrid, and Unicycler-hybrid) and MinION-based (Canu, Canu+Nanopolish, Canu+Pilon) genome assemblies. Results for Pseudonocardia, Aeromonas, and Flavobacterium are shown in blue, red, and green, respectively
Fig. 2
Fig. 2
Comparison of Pseudonocardia assemblies generated during this study. (A): Heatmaps depicting Mash distances between the assemblies of each Pseudonocardia strain based on their shared k-mer content. Whiter colors indicate greater Mash distances between assemblies. (B): Mashtree analysis showing the relationships of all Pseudonocardia assemblies to each other, based on Mash distances. The scale bar represents a Mash distance of 0.003
Fig. 3
Fig. 3
Quantification of insertion/deletions (indels, left) and single nucleotide polymorphisms (SNPs, right) in all strains sequenced during this study, as determined by aligning each assembly to the Canu+Pilon assembly for that strain as a reference
Fig. 4
Fig. 4
Anvi’o analysis of annotation quality. Strains are grouped by species with Pseudonocardia shown in blue, Aeromonas shown in red, and Flavobacterium shown in green. Each heatmap row corresponds to an individual strain and each column corresponds to a unique assembly method
Fig. 5
Fig. 5
The effect of coverage on Canu genome assembly contiguity. The number of contigs (top left), N50 (in Mbp, top center), assembly length (in Mbp, top right), SNPs per 1000 bp (bottom right), and indels per 1000 bp (bottom left) are shown for subsets of the Ps JKS002128 (blue), Av JG3 (red), and Fs ARS-166-14 (green) MinION reads used in Fig. 1
Fig. 6
Fig. 6
Ps JKS002128 genome assembly quality affects secondary metabolite biosynthetic gene cluster annotation. (A) Homologies between BGCs predicted for each Ps JKS002128 assembly, with each row representing a unique BGC in the Ps JKS002128 genome. Filled boxes indicate the BGCs found in each assembly, colored according to the type of secondary metabolite that it is predicted to encode. White boxes indicate BGCs that were not found in that assembly. Some BGCs occur on multiple contigs or are separated into multiple gene clusters on the same assembly, indicated by either two or three polygons within a single box. BGCs may still be fragmented even if represented by a single box. (B) The total number of complete and fragmented BGCs predicted in each Ps JKS002128 genome assembly
Fig. 7
Fig. 7
Fs ARS-166-14 genome assembly quality affects insertion sequences annotation. Both the total number of hits and hits with > 70% amino acid identity to insertion sequences in the ISfinder database are shown. The former likely includes false-positive annotations while the latter is more conservative

Similar articles

Cited by

References

    1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17:333–351. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed
    1. Shendure J, Balasubramanian S, Church GM, Gilbert W, Rogers J, Schloss JA, et al. DNA sequencing at 40: past, present and future. Nature. 2017;550:345–353. doi: 10.1038/nature24286. - DOI - PubMed
    1. Whiteford N, Haslam N, Weber G, Prügel-Bennett A, Essex JW, Roach PL, et al. An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 2005;33:e171. doi: 10.1093/nar/gni170. - DOI - PMC - PubMed
    1. Haubold B, Wiehe T. How repetitive are genomes? BMC Bioinformatics. 2006;7:541. 10.1186/1471-2105-7-541. - PMC - PubMed
    1. Kingsford C, Schatz MC, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics. 2010;11:21. 10.1186/1471-2105-11-21. - PMC - PubMed

MeSH terms

Substances

LinkOut - more resources