BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs

Dmitry Meleshko^{1

2}, Hosein Mohimani^{3

4}, Vittorio Tracanna⁵, Iman Hajirasouliha^{6

7}, Marnix H Medema⁵, Anton Korobeynikov^{1

8}, Pavel A Pevzner^{1

3}

Affiliations

¹ Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 19904.
² Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York 10021, USA.
³ Department of Computer Science and Engineering, University of California, San Diego, California 92093-0404, USA.
⁴ Computational Biology Department, School of Computer Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.
⁵ Bioinformatics Group, Wageningen University, 6708 PB Wageningen, The Netherlands.
⁶ Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, New York 10021, USA.
⁷ Englander Institute for Precision Medicine, Meyer Cancer Center, Weill Cornell Medicine, New York, New York 10021, USA.
⁸ Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia, 198504.

PMID: 31160374
PMCID: PMC6673720
DOI: 10.1101/gr.243477.118

BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs

Dmitry Meleshko et al. Genome Res. 2019 Aug.

. 2019 Aug;29(8):1352-1362.

doi: 10.1101/gr.243477.118. Epub 2019 Jun 3.

Authors

Dmitry Meleshko^{1

2}, Hosein Mohimani^{3

4}, Vittorio Tracanna⁵, Iman Hajirasouliha^{6

7}, Marnix H Medema⁵, Anton Korobeynikov^{1

8}, Pavel A Pevzner^{1

3}

Affiliations

¹ Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 19904.
² Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York 10021, USA.
³ Department of Computer Science and Engineering, University of California, San Diego, California 92093-0404, USA.
⁴ Computational Biology Department, School of Computer Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.
⁵ Bioinformatics Group, Wageningen University, 6708 PB Wageningen, The Netherlands.
⁶ Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, New York 10021, USA.
⁷ Englander Institute for Precision Medicine, Meyer Cancer Center, Weill Cornell Medicine, New York, New York 10021, USA.
⁸ Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia, 198504.

PMID: 31160374
PMCID: PMC6673720
DOI: 10.1101/gr.243477.118

Abstract

Predicting biosynthetic gene clusters (BGCs) is critically important for discovery of antibiotics and other natural products. While BGC prediction from complete genomes is a well-studied problem, predicting BGCs in fragmented genomic assemblies remains challenging. The existing BGC prediction tools often assume that each BGC is encoded within a single contig in the genome assembly, a condition that is violated for most sequenced microbial genomes where BGCs are often scattered through several contigs, making it difficult to reconstruct them. The situation is even more severe in shotgun metagenomics, where the contigs are often short, and the existing tools fail to predict a large fraction of long BGCs. While it is difficult to assemble BGCs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding long BGCs. We describe biosyntheticSPAdes, a tool for predicting BGCs in assembly graphs and demonstrate that it greatly improves the reconstruction of BGCs from genomic and metagenomics data sets.

PubMed Disclaimer

Figures

**Figure 1.**
Subgraph of the assembly graph of *S. coelicolor* corresponding to the CALC NRP BGC. (*Top*) Edges of the assembly graph traversed by the CALC BGC. Nodes of the assembly graph are shown as white circles. After applying exSPAnder, the CALC BGC remains scattered over 10 scaffolds. Three of them are shown as red, blue, and green paths through the assembly graph; the remaining seven consist of a single edge each (shown in black and marked with letters a through g). The positions of eleven A-domains (with their indices) along the CALC BGC are shown by violet boxes. Edges with low and high coverage by reads are shown as thin and thick edges, respectively. The edge harboring three A-domains 4, 5, and 7 has approximately triple coverage by reads as compared to other domain-harboring edges. The 11 A-domains in CALC are split over three NRP synthetases with 6, 3, and 2 A-domains, respectively. (*Middle*) A simplified representation of the graph with all short edges (shorter than 300 bp) contracted into single vertices. The two contracted subgraphs of the assembly graph (formed by short edges) are represented by yellow vertices. The brown dashed path illustrates how the CALC NRP synthetase traverses the contracted assembly graph. (*Bottom*) The bubble restoration procedure described below transforms the collapsed edge harboring three A-domains (A-domains 4, 5, and 7) into three edges, each of them harboring a single A-domain. Applying exSPAnder to the modified assembly graph results in seven scaffolds that differ from scaffolds before bubble restoration (shown as red, blue, green, and orange paths as well as three black edges). Gray squares show the starting and ending positions of the CALC BGC.

**Figure 2.**
The biosyntheticSPAdes pipeline. Six steps of the biosyntheticSPAdes pipeline: (1) assembling genomic/metagenomic reads with SPAdes/metaSPAdes; (2) searching for edges harboring biosynthetic domains in the assembly graph; (3) extracting biosynthetic gene cluster subgraphs from the assembly graph; (4) restoring the collapsed domains in the BGC-subgraphs; (5) constructing the scaffolding graph; and (6) generating putative BGC by solving the Rural Postman Problem in the scaffolding graph.

**Figure 3.**
Subgraph of the assembly graph of *Pseudomonas protegens* Pf-5 corresponding to the pyoverdine NRP BGC. (*Top left*) The pyoverdine BGC is scattered over four scaffolds in the SPAdes assembly. Two scaffolds traversing single edges are shown by black color, and two scaffolds traversing multiple edges are shown by red and green colors. The repeat edges traversed by both red and green scaffolds are shown by brown color. Edges with low and high depth of coverage by reads are shown as thin and thick edges, respectively. Some A-domains span multiple edges (starting and ending positions of such domains are shown with dashed lines). (*Top right*) The domain restoration procedure restored two A-domains (5 and 6) in the assembly (SPAdes collapsed these domains into a single edge). Four scaffolds in the assembly graph are shown by red, green, blue, and black colors. (*Bottom*) The scaffolding graph of the pyoverdine BGC with a single rural postman route (dashed edges in this route are shown in blue).

**Figure 4.**
biosyntheticSPAdes assembly of the hectochlorin BGCs (the CYANO data set). (*Top*) The subgraph of the assembly graph corresponding to the hectochlorin BGC. metaSPAdes assembly results in four scaffolds shown by a red path, a green path, and two black edges. The repeat edges traversed by both red and green scaffolds are shown by the brown color. The domain restoration procedure had no effect on this graph. (*Bottom*) The scaffolding graph of the hectochlorin BGC has only one rural postman route that revealed the correct domain order.

**Figure 5.**
The BGC subgraph and the scaffolding graph for the supragingival plaque metagenome (SRS013723) in the HMP data set. (1,2) The BGC subgraph and the scaffolding graph. (3,4) Two rural postman routes in the scaffolding graph. The duplicated C-domain is highlighted with red border and is traversed twice in the rural postman routes. The numbers labeling the dashed edges indicate their order in the resulting tour. (5) Since biosyntheticSPAdes and antiSMASH use different thresholds and filtering options, antiSMASH identified only five (rather than six) A-domains in the NRP BGC predicted by biosyntheticSPAdes. The three most likely amino acids for each A-domain are shown along with their NRPSpredictor2 (Röttig et al. 2011) scores for the first of two rural postman routes.

**Figure 6.**
Effect of bubble restoration on the reconstruction of the CALC BGC. Schematic representation of repeat collapsing and consensus deterioration in the case of the CALC BGC assembly. While SPAdes outputs a single (and incorrect) consensus sequence of all three collapsed A-domains, these three sequences are not identical. In contrast, biosyntheticSPAdes utilized restored domains and reconstructed their distinct sequences with 100% accuracy (as compared to 99.6% accuracy for SPAdes). Numbers near dashed vertical lines represent the column numbers in the multiple alignment of three A-domain.

**Figure 7.**
The scaffolding graph of the CALC BGC. (*Left*) Five solid edges in the scaffolding graph correspond to five contigs shown in Figure 4 (*bottom*) that contain A-domains. These contigs are shown as a red edge (A-domains 1, 2, 3, 4, and 5), a green edge (A-domain 6), a pink edge (A-domain 7), a blue edge (A-domains 8, 9, and 10), and a black edge (A-domain 11). Eight dashed edges in the scaffolding graph connect solid edges that contain closely located domains in the BGC subgraph. (*Right*) Two rural postman routes in the CALC scaffolding graph. The first tour contains all violet dashed edges and results in the (1, 2, 3, 4, 5, **6, 7**, 8, 9, 10, 11) arrangement of A-domains, while the second tour contains all brown dashed edges and results in the (1, 2, 3, 4, 5, **7, 6**, 8, 9, 10, 11) arrangement of A-domains.

See this image and copyright information in PMC

References

1. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19: 455–477. 10.1089/cmb.2012.0021 - DOI - PMC - PubMed
1. Bentley SD, Chater KF, Cerdeño-Tárraga AM, Challis GL, Thomson NR, James KD, Harris DE, Quail MA, Kieser H, Harper D, et al. 2002. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417: 141–147. 10.1038/417141a - DOI - PubMed
1. Besemer J, Borodovsky M. 2005. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res 33: W451–W454. 10.1093/nar/gki487 - DOI - PMC - PubMed
1. Blin K, Medema MH, Kottmann R, Lee SY, Weber T. 2017. The antiSMASH database, a comprehensive database of microbial secondary metabolite biosynthetic gene clusters. Nucleic Acids Res 45: D555–D559. 10.1093/nar/gkw960 - DOI - PMC - PubMed
1. Cane DE, Walsh CT. 1999. The parallel and convergent universes of polyketide synthases and nonribosomal peptide synthetases. Chem Biol 6: R319–R325. 10.1016/S1074-5521(00)80001-0 - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs

Affiliations

BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources