Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 17;11(6):jkab083.
doi: 10.1093/g3journal/jkab083.

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes

Affiliations

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes

Eric S Tvedte et al. G3 (Bethesda). .

Abstract

The newest generation of DNA sequencing technology is highlighted by the ability to generate sequence reads hundreds of kilobases in length. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have pioneered competitive long read platforms, with more recent work focused on improving sequencing throughput and per-base accuracy. We used whole-genome sequencing data produced by three PacBio protocols (Sequel II CLR, Sequel II HiFi, RS II) and two ONT protocols (Rapid Sequencing and Ligation Sequencing) to compare assemblies of the bacteria Escherichia coli and the fruit fly Drosophila ananassae. In both organisms tested, Sequel II assemblies had the highest consensus accuracy, even after accounting for differences in sequencing throughput. ONT and PacBio CLR had the longest reads sequenced compared to PacBio RS II and HiFi, and genome contiguity was highest when assembling these datasets. ONT Rapid Sequencing libraries had the fewest chimeric reads in addition to superior quantification of E. coli plasmids versus ligation-based libraries. The quality of assemblies can be enhanced by adopting hybrid approaches using Illumina libraries for bacterial genome assembly or polishing eukaryotic genome assemblies, and an ONT-Illumina hybrid approach would be more cost-effective for many users. Genome-wide DNA methylation could be detected using both technologies, however ONT libraries enabled the identification of a broader range of known E. coli methyltransferase recognition motifs in addition to undocumented D. ananassae motifs. The ideal choice of long read technology may depend on several factors including the question or hypothesis under examination. No single technology outperformed others in all metrics examined.

Keywords: Drosophila ananassae; bacterial genomics; fly genomics; genomics; sequencing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Read composition of E. coli long read libraries. Bases sequenced per read length were calculated for 1 kbp bins in each library. Sequenced bases are shown as (A) raw numbers and percentages for complete datasets and (B) percentages for random subsamples of 100X sequencing depth. Vertical dotted lines correspond to maximum read length for each library.
Figure 2
Figure 2
Evidence of DNA methylation in E. coli E2348/69 using long-read sequencing. Methylation at (A) GATC and (B) CCWGG motifs are supported using ONT LIG sequencing. Top: an example motif is shown, with individual reads plotted to the region shown in red. The expected raw signal distribution using a canonical base model (=unmethylated DNA) is shown in grey. The location of known methylation in E. coli is highlighted. Bottom: the fraction of reads supporting a modification event is reported for each position in the motif, and the distribution of proportions are shown. Higher values indicate the motif is more ubiquitously methylated in the E. coli genome. Distributions are shown for 11,313 GATC motifs and 20,063 CCWGG motifs. (C) ROC curves for detection of methylation at known motifs. GATC and CCWGG motifs were considered ground truth and modified base statistics of these sites were compared against statistics at other base modification sites. ROC curves for ONT RAPID (∼60X depth) and ONT LIG (∼3280X) are plotted with corresponding area under the curve (AUC) and average precision (AP) values for each condition shown. (D) Association of m6A modifications assessed using PacBio Sequel II CLR and ONT LIG sequencing. All m6A modifications with a PacBio modification QV >20 were cross-referenced for corresponding dampened fraction values in ONT LIG sequencing. A random sample of 5000 m6A modifications are plotted (total = 48,664). A linear regression was fitted to the data.
Figure 3
Figure 3
Read composition of D. ananassae long read libraries. Bases sequenced per read length were calculated for 1 kbp bins in each library. Reads and sequenced bases are shown as raw numbers and as percentages for complete datasets. Vertical dotted lines correspond to maximum read length for each library.
Figure 4
Figure 4
Nx plot of D. ananassae assemblies. Plot of Nx values for D. ananassae assemblies produced in this study. Each Nx value represents the shortest contig length when summed with all larger contigs totaling X% of the total assembly size. Nx values were calculated in QUAST-LG. Assemblies produced in this study were compared to contigs (broken scaffolds) of two previous assemblies of D. ananassae (Drosophila 12 Genomes Consortium 2007; Miller et al. 2018).
Figure 5
Figure 5
Assembly of six D. ananassae chromosome arms. (A) Chromosome arm contigs from the Dana.UMIGS assembly are labeled with lines connecting polytene map coordinates with estimated locus positions generated with BLAST searches. Original images for polytene maps are from (Tobari 1993). Permissions for the use of polytene map images were purchased from Karger Publishers. (B) Alignments between Dana.UMIGS chromosome arm contig and two representative test assemblies in this study (ONT Canu, PB CLR Flye). Alignments >50 kbp were identified by minimap2 and dot plots were generated using NUCmer. Numbers in parenthesis indicate the number of contigs (broken scaffolds) corresponding to chromosome arms in test assemblies.
Figure 6
Figure 6
Distributions of library sequencing depth across D. ananassae genome. (A) Visualization of library sequencing depth in multiple D. ananassae genome regions. Reads were mapped to the Dana.UMIGS assembly using minimap2 (ONT/PacBio) and bwa mem (Illumina). After removing secondary alignments, sequencing depth for each library was quantified using the purge_haplotigs “hist” command. To estimate sequencing depth of chromosome Y, chromosome 4, and LGT contigs, the number of positions at each depth value were summed for all contigs assigned to those regions in the Dana.UMIGS assembly. To estimate sequencing depth of euchromatic (E) and heterochromatic (H) regions in chromosome X, 2, and 3, BAM files were subsetted with SAMTools using user-defined contig coordinates. Euchromatic regions were approximated as contig regions containing genes from the D. ananassae polytene map (Supplementary Table S11). Heterochromatic regions were approximated as the contig coordinates outside euchromatic intervals. The purge_haplotigs “hist” script was performed again on subsetted BAM files. Since positions having a depth value of zero consider the entirety of a contig (e.g., positions with depth=0 in the chrX euchromatic BAM file is the sum of euchromatic positions with zero depth plus all heterochromatic positions), the counts of positions in each dataset with zero depth were omitted from this analysis. (B) Representation of heterochromatic read depth relative to euchromatic read depth. aFor chromosomes X, 2, and 3, relative representation of heterochromatic regions was calculated as modeH/modeE. bFor chromosome 4, relative representation of LGT regions was calculated as modechr4LGT/modechr4nonLGT. Red and blue values indicate the lowest and highest ratios in each column, respectively.

References

    1. Adams M, McBroome J, Maurer N, Pepper-Tunick E, Saremi Nedda F, et al.2020. One fly–one genome: chromosome-scale genome assembly of a single outbred Drosophila melanogaster. Nucleic Acids Res. 48:e75. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ.. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. - PubMed
    1. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, et al.2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21:30. - PMC - PubMed
    1. Ardui S, Ameur A, Vermeesch JR, Hestand MS.. 2018. Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res. 46:2159–2168. - PMC - PubMed
    1. Bailey TL, Johnson J, Grant CE, Noble WS.. 2015. The MEME suite. Nucleic Acids Res. 43:W39–W49. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources