Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan;16(1):60-68.
doi: 10.1038/s41589-019-0400-9. Epub 2019 Nov 25.

A computational framework to explore large-scale biosynthetic diversity

Affiliations

A computational framework to explore large-scale biosynthetic diversity

Jorge C Navarro-Muñoz et al. Nat Chem Biol. 2020 Jan.

Abstract

Genome mining has become a key technology to exploit natural product diversity. Although initially performed on a single-genome basis, the process is now being scaled up to mine entire genera, strain collections and microbiomes. However, no bioinformatic framework is currently available for effectively analyzing datasets of this size and complexity. In the present study, a streamlined computational workflow is provided, consisting of two new software tools: the 'biosynthetic gene similarity clustering and prospecting engine' (BiG-SCAPE), which facilitates fast and interactive sequence similarity network analysis of biosynthetic gene clusters and gene cluster families; and the 'core analysis of syntenic orthologues to prioritize natural product gene clusters' (CORASON), which elucidates phylogenetic relationships within and across these families. BiG-SCAPE is validated by correlating its output to metabolomic data across 363 actinobacterial strains and the discovery potential of CORASON is demonstrated by comprehensively mapping biosynthetic diversity across a range of detoxin/rimosamide-related gene cluster families, culminating in the characterization of seven detoxin analogues.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest

MHM is on the scientific advisory board of Hexagon Bio and co-founder of Design Pharmaceuticals. NLK, WWM and RJT are on the board of directors of MicroMGx.

Figures

Fig. 1 ∣
Fig. 1 ∣. The BiG-SCAPE/CORASON workflow.
a, The BiG-SCAPE approach analyzes a set of antiSMASH-detected BGCs to construct a similarity network and groups them into GCFs together with MIBiG reference BGCs (indicated in blue). b, Subsequently, CORASON-based multi-locus phylogenetic analysis is employed to illuminate evolutionary relationships of BGCs within each GCF.
Fig. 2 ∣
Fig. 2 ∣. Main concepts in the BiG-SCAPE algorithm.
a, Input data consists of BGC sequences directly imported from antiSMASH runs and/or from MIBiG. Nucleotide sequences are translated and represented as strings of Pfam domains. b, The three metrics that are combined in a single distance include the Jaccard Index (JI), which measures the percentage of shared types of domains; the Adjacency Index (AI), which measures the percentage of pairs of adjacent domains; and the Domain Sequence Similarity (DSS), which is a measure of sequence identity between protein domains encoded in BGC sequences. Weights of these indices have been optimized separately for different BGC classes. For simplicity, only four classes are shown. c, In “glocal” mode, BiG-SCAPE starts with the longest common subcluster of genes between a pair of BGCs and attempts to extend the selection of genes for comparison.
Fig. 3 ∣
Fig. 3 ∣. Sequence similarity and molecular networks.
a, Detail of a BiG-SCAPE network containing validated detoxin and rimosamide BGCs, filtered for the presence of the taurine dioxygenase (TauD) domain. BiG-SCAPE gene cluster family classifications include the rimosamide (turquoise shades) and detoxin (orange shades) families, as well as the ‘Amycolatopsis/P450’ (violet shades), ‘P450/enoyl’ (pink), and ‘supercluster’ (light green shades) families explored in this study. b, Validated BGCs represented by bold-outlined nodes. c, The detoxin and rimosamide molecular family based on tandem MS data of a 363-strain actinomycete library is colored by BiG-SCAPE family. Known detoxin (squares) and rimosamide (diamonds) nodes have solid bold outlines while putative detoxins are circular nodes and novel analogs from this study are indicated by bold, dotted outlines. d, Histogram of all ion-GCF correlation scores resulting from the metabologenomics round run with 0.30 glocal distance cutoff. Known ion-GCF pair correlation scores are overlaid; 6 out of 9 appear in the ‘tail’ of the distribution, which would be indicative of a true connection. The low scoring for benarthin is due to the complicated fragmentation pattern of its BGCs (Supplementary Figure 3).
Fig. 4 ∣
Fig. 4 ∣. CORASON Workflow.
a, Given a query gene in a reference cluster and a custom genome database, CORASON i) searches for query gene homologues, and ii) creates a Cluster Variation Database (CVD) by filtering out all genomic loci not related to the reference BGC, but keeping fragmented clusters and iii) identifies the CVD gene core based on multi-directional best hits. b, Then, CORASON infers a phylogenetic tree by curation and concatenation of the CVD gene core and calculates the frequency of occurrence for each gene family from the reference BGC. The tree will reveal clades of BGCs that may correspond to GCFs from BiG-SCAPE, and which may be responsible for the production of different structural analogues of a natural product family. c, With the same reference BGC, if a new query gene is selected from accessory enzymes instead of the current CVD core, CORASON will visualize a new phylogeny. This tree may contain clades that correspond to GCFs with diverse biosynthetic cores (of scaffold biosynthesis enzymes) that encode the same molecular modifications in different contexts.
Fig. 5 ∣
Fig. 5 ∣. CORASON phylogeny of detoxin/rimosamide-related BGCs.
CORASON phylogenetic reconstruction with tauD as query gene and the Streptomyces sp. NRRL B-1347 BGC as query cluster and rooted with a tauD from Streptomyces sp. NC1. Branches of redundant and highly divergent BGCs were compressed for readability (see uncompressed tree in Supplementary Figure 9). Strain names are followed by their Genbank accession number when available. Genes not found in the reference cluster are colored based on BLAST analysis. Highlighted sections on the tree correspond to BiG-SCAPE-defined families. Bolded strain/BGC names were those investigated in this study with dotted lines indicating BGCs and detoxins discovered just outside the BiG-SCAPE-defined families. The representative structures for each clade illustrate the correspondence between molecular and genomic variations.

References

    1. Traxler MF, Kolter R. Natural products in soil microbe interactions and evolution. Nat Prod Rep. 2015;32(7):956–970. doi:10.1039/c5np00013k - DOI - PubMed
    1. Davies J. Specialized microbial metabolites: functions and origins. J Antibiot (Tokyo). 2013;66(7):361–364. doi:10.1038/ja.2013.61 - DOI - PubMed
    1. Cimermancic P, Medema MH, Claesen J, et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell. 2014;158(2):412–421. doi:10.1016/j.cell.2014.06.034 - DOI - PMC - PubMed
    1. Doroghazi JR, Albright JC, Goering AW, et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat Chem Biol. 2014;10(11):963–968. doi:10.1038/nchembio.1659 - DOI - PMC - PubMed
    1. Dejong CA, Chen GM, Li H, et al. Polyketide and nonribosomal peptide retro-biosynthesis and global gene cluster matching. Nat Chem Biol. 2016;12(12):1007–1014. doi:10.1038/nchembio.2188 - DOI - PubMed

References Online Methods

    1. Bankevich A, Nurk S, Antipov D, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012;19(5):455–477. doi:10.1089/cmb.2012.0021 - DOI - PMC - PubMed
    1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. - PMC - PubMed
    1. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30(4):772–780. doi:10.1093/molbev/mst010 - DOI - PMC - PubMed
    1. Csardi G, Nepusz T. The igraph software package for complex network research. Inter Journal. 2006;Complex Sy:1695. http://igraph.org.
    1. Wickham H, Chang W, others. ggplot2: An implementation of the Grammar of Graphics. R Packag version 07, URL http//CRANR-projectorg/package=ggplot2. 2008.

Publication types

Substances

LinkOut - more resources