. 2021 Jun 17;11(6):jkab083.

doi: 10.1093/g3journal/jkab083.

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes

Eric S Tvedte¹, Mark Gasser¹, Benjamin C Sparklin¹, Jane Michalski^{1

2}, Carl E Hjelmen³, J Spencer Johnston⁴, Xuechu Zhao¹, Robin Bromley¹, Luke J Tallon¹, Lisa Sadzewicz¹, David A Rasko^{1

2}, Julie C Dunning Hotopp^{1

2

5}

Affiliations

¹ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
² Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
³ Department of Biology, Texas A&M University, College Station, TX 77843, USA.
⁴ Department of Entomology, Texas A&M University, College Station, TX 77843, USA.
⁵ Greenebaum Cancer Center, University of Maryland School of Medicine, Baltimore, MD 21201, USA.

PMID: 33768248
PMCID: PMC8495745
DOI: 10.1093/g3journal/jkab083

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes

Eric S Tvedte et al. G3 (Bethesda). 2021.

. 2021 Jun 17;11(6):jkab083.

doi: 10.1093/g3journal/jkab083.

Authors

Affiliations

¹ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
² Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
³ Department of Biology, Texas A&M University, College Station, TX 77843, USA.
⁴ Department of Entomology, Texas A&M University, College Station, TX 77843, USA.
⁵ Greenebaum Cancer Center, University of Maryland School of Medicine, Baltimore, MD 21201, USA.

PMID: 33768248
PMCID: PMC8495745
DOI: 10.1093/g3journal/jkab083

Abstract

The newest generation of DNA sequencing technology is highlighted by the ability to generate sequence reads hundreds of kilobases in length. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have pioneered competitive long read platforms, with more recent work focused on improving sequencing throughput and per-base accuracy. We used whole-genome sequencing data produced by three PacBio protocols (Sequel II CLR, Sequel II HiFi, RS II) and two ONT protocols (Rapid Sequencing and Ligation Sequencing) to compare assemblies of the bacteria Escherichia coli and the fruit fly Drosophila ananassae. In both organisms tested, Sequel II assemblies had the highest consensus accuracy, even after accounting for differences in sequencing throughput. ONT and PacBio CLR had the longest reads sequenced compared to PacBio RS II and HiFi, and genome contiguity was highest when assembling these datasets. ONT Rapid Sequencing libraries had the fewest chimeric reads in addition to superior quantification of E. coli plasmids versus ligation-based libraries. The quality of assemblies can be enhanced by adopting hybrid approaches using Illumina libraries for bacterial genome assembly or polishing eukaryotic genome assemblies, and an ONT-Illumina hybrid approach would be more cost-effective for many users. Genome-wide DNA methylation could be detected using both technologies, however ONT libraries enabled the identification of a broader range of known E. coli methyltransferase recognition motifs in addition to undocumented D. ananassae motifs. The ideal choice of long read technology may depend on several factors including the question or hypothesis under examination. No single technology outperformed others in all metrics examined.

Keywords: Drosophila ananassae; bacterial genomics; fly genomics; genomics; sequencing.

PubMed Disclaimer

Figures

**Figure 1**
Read composition of *E. coli* long read libraries. Bases sequenced per read length were calculated for 1 kbp bins in each library. Sequenced bases are shown as (A) raw numbers and percentages for complete datasets and (B) percentages for random subsamples of 100X sequencing depth. Vertical dotted lines correspond to maximum read length for each library.

**Figure 2**
Evidence of DNA methylation in *E. coli* E2348/69 using long-read sequencing. Methylation at (A) GATC and (B) CCWGG motifs are supported using ONT LIG sequencing. Top: an example motif is shown, with individual reads plotted to the region shown in red. The expected raw signal distribution using a canonical base model (=unmethylated DNA) is shown in grey. The location of known methylation in *E. coli* is highlighted. Bottom: the fraction of reads supporting a modification event is reported for each position in the motif, and the distribution of proportions are shown. Higher values indicate the motif is more ubiquitously methylated in the *E. coli* genome. Distributions are shown for 11,313 GATC motifs and 20,063 CCWGG motifs. (C) ROC curves for detection of methylation at known motifs. GATC and CCWGG motifs were considered ground truth and modified base statistics of these sites were compared against statistics at other base modification sites. ROC curves for ONT RAPID (∼60X depth) and ONT LIG (∼3280X) are plotted with corresponding area under the curve (AUC) and average precision (AP) values for each condition shown. (D) Association of m6A modifications assessed using PacBio Sequel II CLR and ONT LIG sequencing. All m6A modifications with a PacBio modification QV >20 were cross-referenced for corresponding dampened fraction values in ONT LIG sequencing. A random sample of 5000 m6A modifications are plotted (total = 48,664). A linear regression was fitted to the data.

**Figure 3**
Read composition of *D. ananassae* long read libraries. Bases sequenced per read length were calculated for 1 kbp bins in each library. Reads and sequenced bases are shown as raw numbers and as percentages for complete datasets. Vertical dotted lines correspond to maximum read length for each library.

**Figure 4**
Nx plot of *D. ananassae* assemblies. Plot of Nx values for *D. ananassae* assemblies produced in this study. Each Nx value represents the shortest contig length when summed with all larger contigs totaling X% of the total assembly size. Nx values were calculated in QUAST-LG. Assemblies produced in this study were compared to contigs (broken scaffolds) of two previous assemblies of *D. ananassae* (Drosophila 12 Genomes Consortium 2007; Miller *et al.* 2018).

**Figure 5**
Assembly of six *D. ananassae* chromosome arms. (A) Chromosome arm contigs from the Dana.UMIGS assembly are labeled with lines connecting polytene map coordinates with estimated locus positions generated with BLAST searches. Original images for polytene maps are from (Tobari 1993). Permissions for the use of polytene map images were purchased from Karger Publishers. (B) Alignments between Dana.UMIGS chromosome arm contig and two representative test assemblies in this study (ONT Canu, PB CLR Flye). Alignments >50 kbp were identified by minimap2 and dot plots were generated using NUCmer. Numbers in parenthesis indicate the number of contigs (broken scaffolds) corresponding to chromosome arms in test assemblies.

**Figure 6**
Distributions of library sequencing depth across *D. ananassae* genome. (A) Visualization of library sequencing depth in multiple *D. ananassae* genome regions. Reads were mapped to the Dana.UMIGS assembly using minimap2 (ONT/PacBio) and bwa mem (Illumina). After removing secondary alignments, sequencing depth for each library was quantified using the purge_haplotigs “hist” command. To estimate sequencing depth of chromosome Y, chromosome 4, and LGT contigs, the number of positions at each depth value were summed for all contigs assigned to those regions in the Dana.UMIGS assembly. To estimate sequencing depth of euchromatic (E) and heterochromatic (H) regions in chromosome X, 2, and 3, BAM files were subsetted with SAMTools using user-defined contig coordinates. Euchromatic regions were approximated as contig regions containing genes from the *D. ananassae* polytene map (Supplementary Table S11). Heterochromatic regions were approximated as the contig coordinates outside euchromatic intervals. The purge_haplotigs “hist” script was performed again on subsetted BAM files. Since positions having a depth value of zero consider the entirety of a contig (*e.g.*, positions with depth=0 in the chrX euchromatic BAM file is the sum of euchromatic positions with zero depth plus all heterochromatic positions), the counts of positions in each dataset with zero depth were omitted from this analysis. (B) Representation of heterochromatic read depth relative to euchromatic read depth. ^aFor chromosomes X, 2, and 3, relative representation of heterochromatic regions was calculated as mode^H/mode^E. ^bFor chromosome 4, relative representation of LGT regions was calculated as mode^chr4LGT/mode^chr4nonLGT. Red and blue values indicate the lowest and highest ratios in each column, respectively.

See this image and copyright information in PMC

References

1. Adams M, McBroome J, Maurer N, Pepper-Tunick E, Saremi Nedda F, et al. 2020. One fly–one genome: chromosome-scale genome assembly of a single outbred Drosophila melanogaster. Nucleic Acids Res. 48:e75. - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ.. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. - PubMed
1. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, et al. 2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21:30. - PMC - PubMed
1. Ardui S, Ameur A, Vermeesch JR, Hestand MS.. 2018. Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res. 46:2159–2168. - PMC - PubMed
1. Bailey TL, Johnson J, Grant CE, Noble WS.. 2015. The MEME suite. Nucleic Acids Res. 43:W39–W49. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes

Affiliations

Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases