Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 23;12(1):4485.
doi: 10.1038/s41467-021-24515-9.

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Affiliations

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Riccardo Vicedomini et al. Nat Commun. .

Abstract

High-throughput short-read metagenomics has enabled large-scale species-level analysis and functional characterization of microbial communities. Microbiomes often contain multiple strains of the same species, and different strains have been shown to have important differences in their functional roles. Recent advances on long-read based methods enabled accurate assembly of bacterial genomes from complex microbiomes and an as-yet-unrealized opportunity to resolve strains. Here we present Strainberry, a metagenome assembly pipeline that performs strain separation in single-sample low-complexity metagenomes and that relies uniquely on long-read data. We benchmarked Strainberry on mock communities for which it produces strain-resolved assemblies with near-complete reference coverage and 99.9% base accuracy. We also applied Strainberry on real datasets for which it improved assemblies generating 20-118% additional genomic material than conventional metagenome assemblies on individual strain genomes. We show that Strainberry is also able to refine microbial diversity in a complex microbiome, with complete separation of strain genomes. We anticipate this work to be a starting point for further methodological improvements on strain-resolved metagenome assembly in environments of higher complexities.

PubMed Disclaimer

Conflict of interest statement

A.E.D. holds equity in and is cofounder and CSO of Longas Technologies Pty Ltd, which is a commercial entity developing synthetic long-read sequencing technologies. The authors declare no additional competing interest.

Figures

Fig. 1
Fig. 1. Strainberry pipeline.
The pipeline starts from a strain-oblivious assembly and the corresponding set of reads. It then performs haplotype phasing on the strain-oblivious assembly to separate reads into groups that likely correspond to strains. Each group is assembled separately. A final scaffolding step is used to connect sequences likely corresponding to the same strain. The pipeline performs n1 iterations, where n is the maximal number of detected conspecific strains.
Fig. 2
Fig. 2. Mock3 dataset assembly statistics.
a Circos graph displaying the coverage and SNV-rich regions (single-nucleotide errors) of the strain-oblivious assemblies (Flye and Canu) and their strain-separated counterparts obtained with Strainberry (ssFlye and ssCanu, respectively) compared to the reference sequences of the two E. coli strains present. The external graduated scales reflect the genomic positions of the corresponding reference genomes. b Average nucleotide identity (ANI) and c duplication ratio of assemblies.
Fig. 3
Fig. 3. Mock9 dataset assembly statistics.
a Circos graph depicting reference coverage and SNV-rich regions of the strain-oblivious assemblies (Flye and Canu) and their strain-separated counterparts obtained with Strainberry (ssFlye and ssCanu, respectively) compared to the reference sequences of K. pneumoniae and the two S. aureus strains. The external graduated scales represent the genomic positions of the corresponding reference genomes. On the right-hand side, b the average nucleotide identity and c the duplication ratio of each assembly are reported. The ssFlye and ssCanu assemblies were obtained as a result of a Strainberry separation of the Flye and Canu assemblies, respectively. d Circos graph depicting reference coverage and SNV-rich regions of the strain-oblivious assemblies (Flye and Canu) and the strain-separated ones (ssFlye and ssCanu) with respect to the reference sequences of S. sonnei and the two E. coli strains. On the right-hand side, e the average nucleotide identity and f the duplication ratio of each assembly are reported.
Fig. 4
Fig. 4. Evaluation of strain-separated assemblies with respect to strain coverage, divergence, and number of strains.
Average reference coverages and nucleotide identities of the strain-oblivious Flye assemblies and the Strainberry separated assemblies (ssFlye) on simple mock communities characterized by variable strain coverage (first two columns), divergence (third column), and number of strains (fourth column). The variable coverage communities are downsampled versions of the Mock3 dataset (B. cereus, E. coli strain K-12, and E. coli strain W). We kept the same depth of coverage among the strains (uniform downsampling) or a constant 50× coverage for B. cereus and E. coli strain K-12 while downsampling exclusively E. coli strain W (uneven downsampling). The variable divergence datasets consist of simulated reads from two strains where one is E. coli strain K-12 and the other one is listed on the x-axis (with the divergence percentage shown in parenthesis). The datasets with variable number of strains contain 2, 3, 4, and 5 conspecific strains of E. coli with pairwise divergences ranging from 0.7% to 1.4%.
Fig. 5
Fig. 5. Assembly-level coverage of L. helveticus NWC_2_3 and L. delbrueckii NWC_2_2.
Comparison between the reference coverage of the Flye and Canu assemblies and their strain-separated counterparts generated by Strainberry (ssFlye and ssCanu, respectively) for the following references and datasets: a L. helveticus NWC_2_3 reference and PacBio dataset; b L. delbrueckii NWC_2_2 reference and PacBio dataset; c L. helveticus NWC_2_3 reference and ONT dataset; d L. helveticus NWC_2_3 reference and ONT dataset. Orange regions highlight a higher coverage of the strain-separated assembly (ssFlye or ssCanu), blue regions highlight a higher coverage of the strain-oblivious assembly (Flye or Canu), and gray regions represent the common coverage level shared by both assemblies.
Fig. 6
Fig. 6. Assembly size and sequence classification before and after strain separation.
Bins (x-axis) are named at the species level according to the most dominant Kraken2 classification. The value between parentheses represents the average depth of coverage of the bin before the strain separation. Bins highlighted with a bold font have a moderate post-separation completeness (>70%), while those highlighted in red have either poor completeness (<50%) or low read coverage (<30×). Colored bars represent the number of bases classified as a specific species/strain in a bin before and after the strain separation (left and right-hand bars, respectively). Classified sequences whose size accounts for less than 10% of the bin size have been grouped as “others”.

References

    1. Segata N. On the road to strain-resolved comparative metagenomics. MSystems. 2018;3:e00190–17. doi: 10.1128/mSystems.00190-17. - DOI - PMC - PubMed
    1. Van Rossum T, Ferretti P, Maistrenko OM, Bork P. Diversity within species: interpreting strains in microbiomes. Nat. Rev. Microbiol. 2020;18:491–506. doi: 10.1038/s41579-020-0368-1. - DOI - PMC - PubMed
    1. Frank C, et al. Epidemic profile of Shiga-toxin–producing Escherichia coli O104: H4 outbreak in Germany. N. Engl. J. Med. 2011;365:1771–1780. doi: 10.1056/NEJMoa1106483. - DOI - PubMed
    1. Cuevas-Ramos G, et al. Escherichia coli induces DNA damage in vivo and triggers genomic instability in mammalian cells. Proc. Natl Acad. Sci. USA. 2010;107:11537–11542. doi: 10.1073/pnas.1001261107. - DOI - PMC - PubMed
    1. Blaser MJ, et al. Infection with Helicobacter pylori strains possessing cagA is associated with an increased risk of developing adenocarcinoma of the stomach. Cancer Res. 1995;55:2111–2115. - PubMed

Publication types