. 2021 Jul 23;12(1):4485.

doi: 10.1038/s41467-021-24515-9.

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Riccardo Vicedomini¹, Christopher Quince^{2

3

4}, Aaron E Darling⁵, Rayan Chikhi⁶

Affiliations

¹ Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France. riccardo.vicedomini@pasteur.fr.
² Organisms and Ecosystems, Earlham Institute, Norwich, United Kingdom.
³ Gut Microbes and Health, Quadram Institute, Norwich, United Kingdom.
⁴ Warwick Medical School, University of Warwick, Coventry, United Kingdom.
⁵ The iThree Institute, University of Technology Sydney, Ultimo, NSW, Australia.
⁶ Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France.

PMID: 34301928
PMCID: PMC8302730
DOI: 10.1038/s41467-021-24515-9

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Riccardo Vicedomini et al. Nat Commun. 2021.

. 2021 Jul 23;12(1):4485.

doi: 10.1038/s41467-021-24515-9.

Authors

Riccardo Vicedomini¹, Christopher Quince^{2

3

4}, Aaron E Darling⁵, Rayan Chikhi⁶

Affiliations

¹ Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France. riccardo.vicedomini@pasteur.fr.
² Organisms and Ecosystems, Earlham Institute, Norwich, United Kingdom.
³ Gut Microbes and Health, Quadram Institute, Norwich, United Kingdom.
⁴ Warwick Medical School, University of Warwick, Coventry, United Kingdom.
⁵ The iThree Institute, University of Technology Sydney, Ultimo, NSW, Australia.
⁶ Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France.

PMID: 34301928
PMCID: PMC8302730
DOI: 10.1038/s41467-021-24515-9

Abstract

High-throughput short-read metagenomics has enabled large-scale species-level analysis and functional characterization of microbial communities. Microbiomes often contain multiple strains of the same species, and different strains have been shown to have important differences in their functional roles. Recent advances on long-read based methods enabled accurate assembly of bacterial genomes from complex microbiomes and an as-yet-unrealized opportunity to resolve strains. Here we present Strainberry, a metagenome assembly pipeline that performs strain separation in single-sample low-complexity metagenomes and that relies uniquely on long-read data. We benchmarked Strainberry on mock communities for which it produces strain-resolved assemblies with near-complete reference coverage and 99.9% base accuracy. We also applied Strainberry on real datasets for which it improved assemblies generating 20-118% additional genomic material than conventional metagenome assemblies on individual strain genomes. We show that Strainberry is also able to refine microbial diversity in a complex microbiome, with complete separation of strain genomes. We anticipate this work to be a starting point for further methodological improvements on strain-resolved metagenome assembly in environments of higher complexities.

PubMed Disclaimer

Conflict of interest statement

A.E.D. holds equity in and is cofounder and CSO of Longas Technologies Pty Ltd, which is a commercial entity developing synthetic long-read sequencing technologies. The authors declare no additional competing interest.

Figures

**Fig. 1. Strainberry pipeline.**
The pipeline starts from a strain-oblivious assembly and the corresponding set of reads. It then performs haplotype phasing on the strain-oblivious assembly to separate reads into groups that likely correspond to strains. Each group is assembled separately. A final scaffolding step is used to connect sequences likely corresponding to the same strain. The pipeline performs $n - 1$ iterations, where $n$ is the maximal number of detected conspecific strains.

**Fig. 2. Mock3 dataset assembly statistics.**
a Circos graph displaying the coverage and SNV-rich regions (single-nucleotide errors) of the strain-oblivious assemblies (Flye and Canu) and their strain-separated counterparts obtained with Strainberry (ssFlye and ssCanu, respectively) compared to the reference sequences of the two *E. coli* strains present. The external graduated scales reflect the genomic positions of the corresponding reference genomes. b Average nucleotide identity (ANI) and c duplication ratio of assemblies.

**Fig. 3. Mock9 dataset assembly statistics.**
a Circos graph depicting reference coverage and SNV-rich regions of the strain-oblivious assemblies (Flye and Canu) and their strain-separated counterparts obtained with Strainberry (ssFlye and ssCanu, respectively) compared to the reference sequences of *K. pneumoniae* and the two *S. aureus* strains. The external graduated scales represent the genomic positions of the corresponding reference genomes. On the right-hand side, b the average nucleotide identity and c the duplication ratio of each assembly are reported. The ssFlye and ssCanu assemblies were obtained as a result of a Strainberry separation of the Flye and Canu assemblies, respectively. d Circos graph depicting reference coverage and SNV-rich regions of the strain-oblivious assemblies (Flye and Canu) and the strain-separated ones (ssFlye and ssCanu) with respect to the reference sequences of *S. sonnei* and the two *E. coli* strains. On the right-hand side, e the average nucleotide identity and f the duplication ratio of each assembly are reported.

**Fig. 4. Evaluation of strain-separated assemblies with respect to strain coverage, divergence, and number of strains.**
Average reference coverages and nucleotide identities of the strain-oblivious Flye assemblies and the Strainberry separated assemblies (ssFlye) on simple mock communities characterized by variable strain coverage (first two columns), divergence (third column), and number of strains (fourth column). The variable coverage communities are downsampled versions of the Mock3 dataset (*B. cereus*, *E. coli* strain K-12, and *E. coli* strain W). We kept the same depth of coverage among the strains (uniform downsampling) or a constant 50× coverage for *B. cereu*s and *E. coli* strain K-12 while downsampling exclusively *E. coli* strain W (uneven downsampling). The variable divergence datasets consist of simulated reads from two strains where one is *E. coli* strain K-12 and the other one is listed on the x-axis (with the divergence percentage shown in parenthesis). The datasets with variable number of strains contain 2, 3, 4, and 5 conspecific strains of *E. coli* with pairwise divergences ranging from 0.7% to 1.4%.

**Fig. 5. Assembly-level coverage of *L. helveticus* NWC_2_3 and *L. delbrueckii* NWC_2_2.**
Comparison between the reference coverage of the Flye and Canu assemblies and their strain-separated counterparts generated by Strainberry (ssFlye and ssCanu, respectively) for the following references and datasets: a *L. helveticus* NWC_2_3 reference and PacBio dataset; b *L. delbrueckii* NWC_2_2 reference and PacBio dataset; c *L. helveticus* NWC_2_3 reference and ONT dataset; d *L. helveticus* NWC_2_3 reference and ONT dataset. Orange regions highlight a higher coverage of the strain-separated assembly (ssFlye or ssCanu), blue regions highlight a higher coverage of the strain-oblivious assembly (Flye or Canu), and gray regions represent the common coverage level shared by both assemblies.

**Fig. 6. Assembly size and sequence classification before and after strain separation.**
Bins (x-axis) are named at the species level according to the most dominant Kraken2 classification. The value between parentheses represents the average depth of coverage of the bin before the strain separation. Bins highlighted with a bold font have a moderate post-separation completeness (>70%), while those highlighted in red have either poor completeness (<50%) or low read coverage (<30×). Colored bars represent the number of bases classified as a specific species/strain in a bin before and after the strain separation (left and right-hand bars, respectively). Classified sequences whose size accounts for less than 10% of the bin size have been grouped as “others”.

See this image and copyright information in PMC

References

1. Segata N. On the road to strain-resolved comparative metagenomics. MSystems. 2018;3:e00190–17. doi: 10.1128/mSystems.00190-17. - DOI - PMC - PubMed
1. Van Rossum T, Ferretti P, Maistrenko OM, Bork P. Diversity within species: interpreting strains in microbiomes. Nat. Rev. Microbiol. 2020;18:491–506. doi: 10.1038/s41579-020-0368-1. - DOI - PMC - PubMed
1. Frank C, et al. Epidemic profile of Shiga-toxin–producing Escherichia coli O104: H4 outbreak in Germany. N. Engl. J. Med. 2011;365:1771–1780. doi: 10.1056/NEJMoa1106483. - DOI - PubMed
1. Cuevas-Ramos G, et al. Escherichia coli induces DNA damage in vivo and triggers genomic instability in mammalian cells. Proc. Natl Acad. Sci. USA. 2010;107:11537–11542. doi: 10.1073/pnas.1001261107. - DOI - PMC - PubMed
1. Blaser MJ, et al. Infection with Helicobacter pylori strains possessing cagA is associated with an increased risk of developing adenocarcinoma of the stomach. Cancer Res. 1995;55:2111–2115. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Affiliations

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources