. 2020 Jan;16(1):60-68.

doi: 10.1038/s41589-019-0400-9. Epub 2019 Nov 25.

A computational framework to explore large-scale biosynthetic diversity

Jorge C Navarro-Muñoz^#^{1

2}, Nelly Selem-Mojica^#³, Michael W Mullowney^#⁴, Satria A Kautsar¹, James H Tryon⁴, Elizabeth I Parkinson^{5

6}, Emmanuel L C De Los Santos⁷, Marley Yeong¹, Pablo Cruz-Morales³, Sahar Abubucker^{8

9}, Arne Roeters¹, Wouter Lokhorst¹, Antonio Fernandez-Guerra^{10

11

12}, Luciana Teresa Dias Cappelini⁴, Anthony W Goering⁴, Regan J Thomson⁴, William W Metcalf⁵, Neil L Kelleher¹³, Francisco Barona-Gomez¹⁴, Marnix H Medema¹⁵

Affiliations

¹ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands.
² Fungal Natural Products Group, Westerdijk Fungal Biodiversity Institute, Utrecht, the Netherlands.
³ Evolution of Metabolic Diversity Laboratory, Unidad de Genómica Avanzada (Langebio), Cinvestav-IPN, Irapuato, Mexico.
⁴ Department of Chemistry, Northwestern University, Evanston, IL, USA.
⁵ Carl R. Woese Institute for Genomic Biology and Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
⁶ Department of Chemistry, Purdue University, West Lafayette, IN, USA.
⁷ Warwick Integrative Synthetic Biology Centre, University of Warwick, Coventry, UK.
⁸ Novartis Institutes for BioMedical Research, Cambridge, MA, USA.
⁹ Sanofi, Cambridge, MA, USA.
¹⁰ Microbial Genomics and Bioinformatics, Max Planck Institute for Marine Microbiology, Bremen, Germany.
¹¹ Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark.
¹² Center for Marine Environmental Sciences, University of Bremen, Bremen, Germany.
¹³ Department of Chemistry, Northwestern University, Evanston, IL, USA. n-kelleher@northwestern.edu.
¹⁴ Evolution of Metabolic Diversity Laboratory, Unidad de Genómica Avanzada (Langebio), Cinvestav-IPN, Irapuato, Mexico. francisco.barona@cinvestav.mx.
¹⁵ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands. marnix.medema@wur.nl.

^# Contributed equally.

PMID: 31768033
PMCID: PMC6917865
DOI: 10.1038/s41589-019-0400-9

A computational framework to explore large-scale biosynthetic diversity

Jorge C Navarro-Muñoz et al. Nat Chem Biol. 2020 Jan.

. 2020 Jan;16(1):60-68.

doi: 10.1038/s41589-019-0400-9. Epub 2019 Nov 25.

Authors

Affiliations

¹ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands.
² Fungal Natural Products Group, Westerdijk Fungal Biodiversity Institute, Utrecht, the Netherlands.
³ Evolution of Metabolic Diversity Laboratory, Unidad de Genómica Avanzada (Langebio), Cinvestav-IPN, Irapuato, Mexico.
⁴ Department of Chemistry, Northwestern University, Evanston, IL, USA.
⁵ Carl R. Woese Institute for Genomic Biology and Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
⁶ Department of Chemistry, Purdue University, West Lafayette, IN, USA.
⁷ Warwick Integrative Synthetic Biology Centre, University of Warwick, Coventry, UK.
⁸ Novartis Institutes for BioMedical Research, Cambridge, MA, USA.
⁹ Sanofi, Cambridge, MA, USA.
¹⁰ Microbial Genomics and Bioinformatics, Max Planck Institute for Marine Microbiology, Bremen, Germany.
¹¹ Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark.
¹² Center for Marine Environmental Sciences, University of Bremen, Bremen, Germany.
¹³ Department of Chemistry, Northwestern University, Evanston, IL, USA. n-kelleher@northwestern.edu.
¹⁴ Evolution of Metabolic Diversity Laboratory, Unidad de Genómica Avanzada (Langebio), Cinvestav-IPN, Irapuato, Mexico. francisco.barona@cinvestav.mx.
¹⁵ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands. marnix.medema@wur.nl.

^# Contributed equally.

PMID: 31768033
PMCID: PMC6917865
DOI: 10.1038/s41589-019-0400-9

Abstract

Genome mining has become a key technology to exploit natural product diversity. Although initially performed on a single-genome basis, the process is now being scaled up to mine entire genera, strain collections and microbiomes. However, no bioinformatic framework is currently available for effectively analyzing datasets of this size and complexity. In the present study, a streamlined computational workflow is provided, consisting of two new software tools: the 'biosynthetic gene similarity clustering and prospecting engine' (BiG-SCAPE), which facilitates fast and interactive sequence similarity network analysis of biosynthetic gene clusters and gene cluster families; and the 'core analysis of syntenic orthologues to prioritize natural product gene clusters' (CORASON), which elucidates phylogenetic relationships within and across these families. BiG-SCAPE is validated by correlating its output to metabolomic data across 363 actinobacterial strains and the discovery potential of CORASON is demonstrated by comprehensively mapping biosynthetic diversity across a range of detoxin/rimosamide-related gene cluster families, culminating in the characterization of seven detoxin analogues.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest

MHM is on the scientific advisory board of Hexagon Bio and co-founder of Design Pharmaceuticals. NLK, WWM and RJT are on the board of directors of MicroMGx.

Figures

**Fig. 1 ∣. The BiG-SCAPE/CORASON workflow.**
a, The BiG-SCAPE approach analyzes a set of antiSMASH-detected BGCs to construct a similarity network and groups them into GCFs together with MIBiG reference BGCs (indicated in blue). b, Subsequently, CORASON-based multi-locus phylogenetic analysis is employed to illuminate evolutionary relationships of BGCs within each GCF.

**Fig. 2 ∣. Main concepts in the BiG-SCAPE algorithm.**
a, Input data consists of BGC sequences directly imported from antiSMASH runs and/or from MIBiG. Nucleotide sequences are translated and represented as strings of Pfam domains. b, The three metrics that are combined in a single distance include the Jaccard Index (JI), which measures the percentage of shared types of domains; the Adjacency Index (AI), which measures the percentage of pairs of adjacent domains; and the Domain Sequence Similarity (DSS), which is a measure of sequence identity between protein domains encoded in BGC sequences. Weights of these indices have been optimized separately for different BGC classes. For simplicity, only four classes are shown. c, In “glocal” mode, BiG-SCAPE starts with the longest common subcluster of genes between a pair of BGCs and attempts to extend the selection of genes for comparison.

**Fig. 3 ∣. Sequence similarity and molecular networks.**
a, Detail of a BiG-SCAPE network containing validated detoxin and rimosamide BGCs, filtered for the presence of the taurine dioxygenase (TauD) domain. BiG-SCAPE gene cluster family classifications include the rimosamide (turquoise shades) and detoxin (orange shades) families, as well as the ‘*Amycolatopsis*/P450’ (violet shades), ‘P450/enoyl’ (pink), and ‘supercluster’ (light green shades) families explored in this study. b, Validated BGCs represented by bold-outlined nodes. c, The detoxin and rimosamide molecular family based on tandem MS data of a 363-strain actinomycete library is colored by BiG-SCAPE family. Known detoxin (squares) and rimosamide (diamonds) nodes have solid bold outlines while putative detoxins are circular nodes and novel analogs from this study are indicated by bold, dotted outlines. d, Histogram of all ion-GCF correlation scores resulting from the metabologenomics round run with 0.30 glocal distance cutoff. Known ion-GCF pair correlation scores are overlaid; 6 out of 9 appear in the ‘tail’ of the distribution, which would be indicative of a true connection. The low scoring for benarthin is due to the complicated fragmentation pattern of its BGCs (Supplementary Figure 3).

**Fig. 4 ∣. CORASON Workflow.**
a, Given a query gene in a reference cluster and a custom genome database, CORASON i) searches for query gene homologues, and ii) creates a Cluster Variation Database (CVD) by filtering out all genomic loci not related to the reference BGC, but keeping fragmented clusters and iii) identifies the CVD gene core based on multi-directional best hits. b, Then, CORASON infers a phylogenetic tree by curation and concatenation of the CVD gene core and calculates the frequency of occurrence for each gene family from the reference BGC. The tree will reveal clades of BGCs that may correspond to GCFs from BiG-SCAPE, and which may be responsible for the production of different structural analogues of a natural product family. c, With the same reference BGC, if a new query gene is selected from accessory enzymes instead of the current CVD core, CORASON will visualize a new phylogeny. This tree may contain clades that correspond to GCFs with diverse biosynthetic cores (of scaffold biosynthesis enzymes) that encode the same molecular modifications in different contexts.

**Fig. 5 ∣. CORASON phylogeny of detoxin/rimosamide-related BGCs.**
CORASON phylogenetic reconstruction with *tauD* as query gene and the *Streptomyces* sp. NRRL B-1347 BGC as query cluster and rooted with a *tauD* from *Streptomyces* sp. NC1. Branches of redundant and highly divergent BGCs were compressed for readability (see uncompressed tree in Supplementary Figure 9). Strain names are followed by their Genbank accession number when available. Genes not found in the reference cluster are colored based on BLAST analysis. Highlighted sections on the tree correspond to BiG-SCAPE-defined families. Bolded strain/BGC names were those investigated in this study with dotted lines indicating BGCs and detoxins discovered just outside the BiG-SCAPE-defined families. The representative structures for each clade illustrate the correspondence between molecular and genomic variations.

See this image and copyright information in PMC

References

1. Traxler MF, Kolter R. Natural products in soil microbe interactions and evolution. Nat Prod Rep. 2015;32(7):956–970. doi:10.1039/c5np00013k - DOI - PubMed
1. Davies J. Specialized microbial metabolites: functions and origins. J Antibiot (Tokyo). 2013;66(7):361–364. doi:10.1038/ja.2013.61 - DOI - PubMed
1. Cimermancic P, Medema MH, Claesen J, et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell. 2014;158(2):412–421. doi:10.1016/j.cell.2014.06.034 - DOI - PMC - PubMed
1. Doroghazi JR, Albright JC, Goering AW, et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat Chem Biol. 2014;10(11):963–968. doi:10.1038/nchembio.1659 - DOI - PMC - PubMed
1. Dejong CA, Chen GM, Li H, et al. Polyketide and nonribosomal peptide retro-biosynthesis and global gene cluster matching. Nat Chem Biol. 2016;12(12):1007–1014. doi:10.1038/nchembio.2188 - DOI - PubMed

References Online Methods

1. Bankevich A, Nurk S, Antipov D, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012;19(5):455–477. doi:10.1089/cmb.2012.0021 - DOI - PMC - PubMed
1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. - PMC - PubMed
1. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30(4):772–780. doi:10.1093/molbev/mst010 - DOI - PMC - PubMed
1. Csardi G, Nepusz T. The igraph software package for complex network research. Inter Journal. 2006;Complex Sy:1695. http://igraph.org.
1. Wickham H, Chang W, others. ggplot2: An implementation of the Grammar of Graphics. R Packag version 07, URL http//CRANR-projectorg/package=ggplot2. 2008.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A computational framework to explore large-scale biosynthetic diversity

Affiliations

A computational framework to explore large-scale biosynthetic diversity

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

References Online Methods

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases