Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 19;24(2):bbad050.
doi: 10.1093/bib/bbad050.

Comparison of long- and short-read metagenomic assembly for low-abundance species and resistance genes

Affiliations

Comparison of long- and short-read metagenomic assembly for low-abundance species and resistance genes

Sosie Yorki et al. Brief Bioinform. .

Abstract

Recent technological and computational advances have made metagenomic assembly a viable approach to achieving high-resolution views of complex microbial communities. In previous benchmarking, short-read (SR) metagenomic assemblers had the highest accuracy, long-read (LR) assemblers generated the most contiguous sequences and hybrid (HY) assemblers balanced length and accuracy. However, no assessments have specifically compared the performance of these assemblers on low-abundance species, which include clinically relevant organisms in the gut. We generated semi-synthetic LR and SR datasets by spiking small and increasing amounts of Escherichia coli isolate reads into fecal metagenomes and, using different assemblers, examined E. coli contigs and the presence of antibiotic resistance genes (ARGs). For ARG assembly, although SR assemblers recovered more ARGs with high accuracy, even at low coverages, LR assemblies allowed for the placement of ARGs within longer, E. coli-specific contigs, thus pinpointing their taxonomic origin. HY assemblies identified resistance genes with high accuracy and had lower contiguity than LR assemblies. Each assembler type's strengths were maintained even when our isolate was spiked in with a competing strain, which fragmented and reduced the accuracy of all assemblies. For strain characterization and determining gene context, LR assembly is optimal, while for base-accurate gene identification, SR assemblers outperform other options. HY assembly offers contiguity and base accuracy, but requires generating data on multiple platforms, and may suffer high misassembly rates when strain diversity exists. Our results highlight the trade-offs associated with each approach for recovering low-abundance taxa, and that the optimal approach is goal-dependent.

Keywords: antibiotic resistance; assembly benchmarking; long reads; low abundance; metagenomic assembly; plasmid assembly.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Semi-synthetic datasets for assessment of metagenomic assembly of low-abundance species. E. coli isolate reads were computationally added to metagenomic sequencing reads from a human fecal sample at 1x–50x spike-in coverage. SR, LR and HY assemblers were used to assemble the semi-synthetic datasets. We assessed the ability of each assembler to recover the ARGs within the well-characterized, spiked-in isolate.
Figure 2
Figure 2
LR and HY assemblers generated more contiguous assemblies of E. coli in metagenomes. Assembly statistics for each coverage level, averaged across all three isolates individually spiked into the B1 background. (A) Percent of the E. coli genome present in the assembly, calculated by tabulating contigs >500 bp with alignment identity >95% to the spiked-in isolate’s reference genome (default parameters from metaQUAST). (B) NGA50 of E. coli isolate contigs in each assembly. Data are only shown at ≥3x coverage, where the sum of the reference contig alignments exceeded 50% of the reference genome length. (C) Percent identity of the longest alignment within a single contig to the E. coli isolate genome. (D) Number of misassemblies (including translocations and relocations) identified by metaQUAST. Error bars show +/- 1 S.D.
Figure 3
Figure 3
metaFlye (with and without Pilon polishing) generated the most contiguous plasmid assemblies. Circular depictions of E. coli contigs in the single 150 kb plasmid present in the I3 isolate, assembled from spike-ins into the B1 background. Areas of correct assembly (blue) and misassemblies (red), as determined by metaQUAST, are shown for increasing spike-in coverages, ranging from 0x (innermost band) to 50x (outermost blue/red band). The outermost yellow and purple band indicates locations of ARGs (purple) and transposons (yellow) in the isolate genome. Assemblies are shown for (A) MEGAHIT (SR), (B) metaSPAdes (SR), (C) OPERA-MS (HY-SL), (D) metaFlye with Pilon polishing (HY-LS), (E) metaFlye (LR). Similar trends were seen across all 15 plasmids (SI Figure S2).
Figure 4
Figure 4
SR assemblers identified more E. coli isolate ARGs than LR and HY at coverages <5x. For each metagenomic assembly in the B1 background, the completeness of ARGs present in the E. coli isolate sequence is shown, averaged across spike-in experiments using the I1, I2 and I3 isolates, which contained 66, 70 and 75 ARGs, respectively. ARGs identified as ‘Strict’ or ‘Perfect’ hits using RGI are shown (Methods). (A) MEGAHIT; (B) metaSPAdes; (C) OPERA-MS; (D) metaFlye with Pilon polishing; (E) metaFlye.
Figure 5
Figure 5
HY and LR assemblies contain long, species-specific contigs. Comparison of contig lengths for E. coli-specific versus non-E. coli-specific chromosome contigs for the I3 isolate spiked into the B1 background. Each E. coli contig >1 kb that contained chromosomal ARGs from the I3 isolate (>99% assembly length) is represented by a dot (blue = E. coli-specific; red = non-E. coli-specific). Contigs were considered species-specific if E. coli was the only top BLAST hit when searched against the Refseq database. The red dashed line shows the length of the I3 isolate’s chromosome. Data shown are for (A) MEGAHIT, (B) metaSPAdes, (C) OPERA-MS, (D) metaFlye with Pilon polishing, (E) metaFlye. Results for the I1 and I2 isolates spiked into the B1 background are shown in SI Figure S4.
Figure 6
Figure 6
For spike-ins with an equal abundance of two competing strains, assemblies were more fragmented and less accurate. Each column displays metrics for a different combination of isolate(s) spiked into background B1. (A) Isolate I1 only; (B) isolates I1 and I2 (99.7% ANI) at equal abundance; (C) isolates I1 and I3 (97% ANI) at equal abundance. Each row represents an assembly metric: (i) Percent isolate I1 assembled. The text on each graph reflects the maximum percent of the I1 genome assembled at 50x coverage out of all assemblers to show how strain multiplicity reduces genome completeness. (ii) Target E. coli (isolate I1) NGA50 (kb). (iii) Percent identity of the strain I1. (iv) The number of misassemblies in isolate I1’s assembly. Bottom right panel: Text indicates OPERA-MS misassembly values out-of-bounds.

Similar articles

Cited by

References

    1. Eloe-Fadrosh EA, Paez-Espino D, Jarett J, et al. . Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs. Nat Commun 2016;7:10476. - PMC - PubMed
    1. Reysenbach A-L, St John E, Meneghin J, et al. . Complex subsurface hydrothermal fluid mixing at a submarine arc volcano supports distinct and highly diverse microbial communities. Proc Natl Acad Sci U S A 2020;117:32627–38. - PMC - PubMed
    1. Almeida A, Nayfach S, Boland M, et al. . A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol 2021;39:105–14. - PMC - PubMed
    1. Jørgensen TS, Xu Z, Hansen MA, et al. . Hundreds of circular novel plasmids and DNA elements identified in a rat cecum metamobilome. PLoS One 2014;9:e87924. - PMC - PubMed
    1. Li A-D, Li L-G, Zhang T. Exploring antibiotic resistance genes and metal resistance genes in plasmid metagenomes from wastewater treatment plants. Front Microbiol 2015;6:1025. - PMC - PubMed

Publication types

MeSH terms