Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jul 8:2023.07.07.548136.
doi: 10.1101/2023.07.07.548136.

Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs

Affiliations

Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs

Gaëtan Benoit et al. bioRxiv. .

Update in

Abstract

We introduce a novel metagenomics assembler for high-accuracy long reads. Our approach, implemented as metaMDBG, combines highly efficient de Bruijn graph assembly in minimizer space, with both a multi-k' approach for dealing with variations in genome coverage depth and an abundance-based filtering strategy for simplifying strain complexity. The resulting algorithm is more efficient than the state-of-the-art but with better assembly results. metaMDBG was 1.5 to 12 times faster than competing assemblers and requires between one-tenth and one-thirtieth of the memory across a range of data sets. We obtained up to twice as many high-quality circularised prokaryotic metagenome assembled genomes (MAGs) on the most complex communities, and a better recovery of viruses and plasmids. metaMDBG performs particularly well for abundant organisms whilst being robust to the presence of strain diversity. The result is that for the first time it is possible to efficiently reconstruct the majority of complex communities by abundance as near-complete MAGs.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement The authors declare no competing interests.

Figures

Figure 1:
Figure 1:. Overview of the algorithmic steps of metaMDBG.
(A) Overview of the multi-k′ assembly strategy. Processes in blue are performed at the level of nucleotide sequences, while the ones in green are performed at the level of minimizers only. (B) Components for estimating and refining k′-min-mer abundance as k′ is increased, and filtering errors prior to graph construction. (C) Illustration of the ‘local progressive abundance filter’ algorithm that simplifies complex graph regions generated by errors, inter-genomic repeats and strain variability. Each node represents an unitig (unitigs in green and blue belong to two distinct species, unitigs in red represents errors). The long unitig on the top-left part of the graph is chosen as seed (step C.1). Its abundance (4) is used as reference to apply a ‘local progressive abundance filter’ from one-times to half its abundance (step C.2 and C.3). At each step, unitigs with abundance equal to the cutoff value are removed, then the graph is re-compacted to simplify fragmented unitigs. Note that fragmented green unitigs with abundance 2 would have been removed without the intermediate step C.2.
Figure 2:
Figure 2:. Assembly results on three metagenomic projects.
‘Human Gut’ represents the co-assembly of the four human gut samples, ‘Anaerobic Digester’ is the co-assembly of the three AD2 time-series samples. (A) CheckM evaluation. A MAG is ‘near-complete’ if its completeness is ≥ 90% and its contamination is ≤ 5%, ‘high-quality’ if completeness ≥ 70% and contamination ≤ 10%, ‘medium quality’ if completeness ≥ 50% and contamination ≤ 10%. (B) The percentage of mapped HiFi reads on MAGs. (C-D) The distribution of SNV density (%) and coverage depths for near-complete circular contigs generated by each assembler on all datasets (the y-axes have been sqrt-scaled); here the bars are overlaid and not stacked.
Figure 3:
Figure 3:. Phylogenetic tree of genera recovered from the AD dataset for all assemblers combined.
(A) For the near-complete bacterial MAGs, we generated a de novo phylogenetic tree based on GTDB-Tk marker genes and display at the genus level. The outer bar-charts give the number of MAGs found in each genus. The coloured symbols then denote genera recovered by only one of the assemblers. The grayscale heat-map denotes the aggregate abundance of dereplicated MAGs in a genus. (B) Number of taxa at different levels that are unique to each assembler.

References

    1. Quince C., Walker A.W., Simpson J.T., Loman N.J., and Segata N.. Shotgun metagenomics, from sampling to analysis. Nature Biotechnology, 35(9), 2017. - PubMed
    1. The Human Microbiome Project Consortium. A framework for human microbiome research. Nature, 486(7402):215–221, 2012. - PMC - PubMed
    1. Edgar Robert C, Taylor Jeff, Lin Victor, Altman Tomer, Barbera Pierre, Meleshko Dmitry, Lohr Dan, Novakovsky Gherman, Buchfink Benjamin, Al-Shayeb Basem, et al. Petabase-scale sequence alignment catalyses viral discovery. Nature, 602(7895):142–147, 2022. - PubMed
    1. Alneberg J., Bjarnason B.S., De Bruijn I., Schirmer M., Quick J., Ijaz U.Z., Lahti L., Loman N.J., Andersson A.F., and Quince C.. Binning metagenomic contigs by coverage and composition. Nature Methods, 11(11), 2014. - PubMed
    1. Moss Eli L, Maghini Dylan G, and Bhatt Ami S. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nature Biotechnology, 38(6):701–707, 2020. - PMC - PubMed

Publication types