Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov;17(11):1103-1110.
doi: 10.1038/s41592-020-00971-x. Epub 2020 Oct 5.

metaFlye: scalable long-read metagenome assembly using repeat graphs

Affiliations

metaFlye: scalable long-read metagenome assembly using repeat graphs

Mikhail Kolmogorov et al. Nat Methods. 2020 Nov.

Abstract

Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors declare no competing interests.

Figures

Extended Data Fig. 1
Extended Data Fig. 1. Information about metaFlye, Flye, Canu, miniasm, and wtdbg2 assemblies of the individual genomes in the SYNTH64 dataset.
NGA50 (in megabases) and reference coverage (in percentages) reported for all genomes from the SYNTH64 dataset. Genomes are ordered in the increasing mean NGA50 across all assemblers. Challenging genomes that have closely related species or strains in the metagenome are marked with (!). Grey bars on the NGA50 plot represent the length of the longest chromosome in the reference sequence for each genome (a theoretical upper bound for NGA50). NGA50 is shown in logarithmic scale (not shown for values lower than 100 kb or if the reference coverage is below 50%). The full metaQUAST report for the SYNTH64 dataset is provided in Supplementary Table 1.
Extended Data Fig. 2
Extended Data Fig. 2. NGAx plots for the mock community datasets (HMP mock, ZymoEven GridION, ZymoLog GridION).
NGA(x) is the statistic computed for contigs that are broken at their misassembly breakpoints (if any). NGA(x) is the highest possible number L such that all broken contigs that are longer than L cover at least X% of the reference. Plots were generated by metaQUAST using all available references for each dataset. Flye failed to assemble the ZymoLog datasets due to poor k-mer indexing (Methods).
Extended Data Fig. 3
Extended Data Fig. 3. Base-pair accuracy analysis for assemblies of the mock community datasets (HMP, ZymoEven GridION, and ZymoLog GridION).
Heatmaps showing the number of mismatches and short indels per 100 kbp for each species reference, computed using metaQUAST. Blue and red colors correspond to the values higher and lower than the median, respectively. Statistics were not computed for genomes with no assembled sequence (“-” symbol). Flye failed to assemble the ZymoLog datasets due to poor k-mer indexing (Methods).
Extended Data Fig. 4
Extended Data Fig. 4. The ORF lengths distribution and the GC content distribution of metaFlye and Canu assemblies of the sheep microbiome.
The ORF length distribution suggests similar base-level accuracy for both assemblies.
Extended Data Fig. 5
Extended Data Fig. 5. Taxonomic assignments of sheep microbiome assemblies.
(a) metaFlye contigs assignment at the phylum level visualized with BlobTools. (b) Length distributions of metaFlye and Canu contigs within each assigned superkingdom.
Extended Data Fig. 6
Extended Data Fig. 6. Statistics of simple bubbles for the metaFlye assemblies human gut and cow rumen.
(Left) the human gut dataset with 615 bubbles, and (right) the cow rumen dataset with 1510 bubbles. Bubble counts exclude loops, and include roundabouts with two edges.
Extended Data Fig. 7
Extended Data Fig. 7. Analysis of sequence overlap between 19 human gut samples.
Multi-way sequence alignments were computed using SiebliaZ. (left) The proportions of unique and shared sequences in each sample. An assembled segment within a sample is called unique if it has no alignments against sequence from any other samples. Otherwise, the segment is shared. (right) The total amount of sequence for each multiplicity bin. A sequence fragment belongs to the multiplicity bin X if it is shared by exactly X samples.
Figure 1.
Figure 1.. metaFlye repeat annotation and examples of simple bubbles, superbubbles, and roundabouts.
(a) The subgraph of an assembly graph formed by four distinct genome sub-paths. Repeat and unique edges are shown in color and black, respectively. metaFlye identifies edges X, Y, and Z as repetitive by analyzing the distinct read-paths through the sub-graph. (b) A simple bubble formed by two strains. (c) A superbubble formed by three strains. (d) A roundabout formed by two strains, one of which shares a repeat with a different region of the metagenome.
Figure 2.
Figure 2.. Information about Canu, Flye, metaFlye, miniasm, and wtdbg2 assemblies of the individual genomes in the SYNTH181 dataset.
Assembled fraction and NGA50 are reported for all 181 reference genomes from the simulated dataset. Genomes are ordered in the decreasing mean assembled fraction (left) and NGA50 (right) across five assemblers. NGA50 is the statistic computed for contigs that are broken at their misassembly breakpoints (if any). NGA50 is the highest possible number L such that all broken contigs that are longer than L cover at least 50% of the reference. NGA50 is not shown for values lower than 10 kbp or if the reference coverage is below 50%. 77 (metaFlye), 141 (Flye), 109 (Canu), 106 (miniasm) and 109 (wtdbg2) NGA50 values were filtered this way. The full metaQUAST report is provided in Supplementary Table 2.
Figure 3.
Figure 3.. Per-species reference coverage and NGA50 statistics for the mock community datasets (HMP, ZymoEven GridION, ZymoLog GridION) computed using metaQUAST.
The read coverage for each species is given in the brackets after the species name. NGA50 values are not reported for assemblies with reference coverage below 50%. Blue and red colors correspond to the values higher and lower than the median, respectively. Flye failed to assemble the ZymoLog datasets due to poor k-mer indexing (Methods). Extended Data Figure 3 provides the base-pair quality analysis for the same datasets.
Figure 4.
Figure 4.. Information about strains in the sheep microbiome revealed by metaFlye.
(a) An assembly graph of a single connected component in the sheep microbiome dataset before strain collapsing (visualized using Bandage). The component represents a bacterial genome of the Clostridia class with 92% conserved marker completion (computed using CheckM). There are 20 simple bubbles (shown in green) and 10 superbubbles (shown in yellow) that account for 1.2 Mbp out of 2.4 Mbp long genome. (b) Distribution of length and branch sequence identities of 1141 bubbles (excluding loops and including roundabouts with only two edges) in the sheep microbiome assembly. The length is defined as the length of the longest branch in a simple bubble.

References

    1. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT and Malla S, (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology, 36(4), p.338. - PMC - PubMed
    1. Miga Karen H., Koren Sergey, Rhie Arang, Vollger Mitchell R., Gershman Ariel, Bzikadze Andrey, Brooks Shelise et al. (2020) “Telomere-to-telomere assembly of a complete human X chromosome.” Nature 10.1038/s41586-020-2547-7 - DOI - PMC - PubMed
    1. Tsai YC, Conlan S, Deming C, Segre JA, Kong HH, Korlach J, ... & NISC Comparative Sequencing Program. (2016). Resolving the complexity of human skin metagenomes using single-molecule sequencing. MBio, 7(1), e01948–15. - PMC - PubMed
    1. Driscoll CB, Otten TG, Brown NM, & Dreher TW (2017). Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture. Standards in Genomic Sciences, 12(1), 9. - PMC - PubMed
    1. Nicholls SM, Quick JC, Tang S, & Loman NJ (2019). Ultra-deep, long-read nanopore sequencing of mock microbial community standards. GigaScience, 8(5), 1–9 - PMC - PubMed

Publication types