. 2021 May 4;17(5):e1008920.

doi: 10.1371/journal.pcbi.1008920. eCollection 2021 May.

Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions

Grímur Hjörleifsson Eldjárn¹, Andrew Ramsay¹, Justin J J van der Hooft², Katherine R Duncan³, Sylvia Soldatou⁴, Juho Rousu⁵, Rónán Daly⁶, Joe Wandy⁶, Simon Rogers¹

Affiliations

¹ School of Computing Science, University of Glasgow, Glasgow, United Kingdom.
² Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
³ Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow, United Kingdom.
⁴ School of Pharmacy and Life Sciences, Robert Gordon University, Aberdeen, United Kingdom.
⁵ Department of Computer Science, Aalto University, Espoo, Finland.
⁶ Glasgow Polyomics, University of Glasgow, Glasgow, United Kingdom.

PMID: 33945539
PMCID: PMC8130963
DOI: 10.1371/journal.pcbi.1008920

Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions

Grímur Hjörleifsson Eldjárn et al. PLoS Comput Biol. 2021.

. 2021 May 4;17(5):e1008920.

doi: 10.1371/journal.pcbi.1008920. eCollection 2021 May.

Authors

Grímur Hjörleifsson Eldjárn¹, Andrew Ramsay¹, Justin J J van der Hooft², Katherine R Duncan³, Sylvia Soldatou⁴, Juho Rousu⁵, Rónán Daly⁶, Joe Wandy⁶, Simon Rogers¹

Affiliations

¹ School of Computing Science, University of Glasgow, Glasgow, United Kingdom.
² Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
³ Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow, United Kingdom.
⁴ School of Pharmacy and Life Sciences, Robert Gordon University, Aberdeen, United Kingdom.
⁵ Department of Computer Science, Aalto University, Espoo, Finland.
⁶ Glasgow Polyomics, University of Glasgow, Glasgow, United Kingdom.

PMID: 33945539
PMCID: PMC8130963
DOI: 10.1371/journal.pcbi.1008920

Abstract

Specialised metabolites from microbial sources are well-known for their wide range of biomedical applications, particularly as antibiotics. When mining paired genomic and metabolomic data sets for novel specialised metabolites, establishing links between Biosynthetic Gene Clusters (BGCs) and metabolites represents a promising way of finding such novel chemistry. However, due to the lack of detailed biosynthetic knowledge for the majority of predicted BGCs, and the large number of possible combinations, this is not a simple task. This problem is becoming ever more pressing with the increased availability of paired omics data sets. Current tools are not effective at identifying valid links automatically, and manual verification is a considerable bottleneck in natural product research. We demonstrate that using multiple link-scoring functions together makes it easier to prioritise true links relative to others. Based on standardising a commonly used score, we introduce a new, more effective score, and introduce a novel score using an Input-Output Kernel Regression approach. Finally, we present NPLinker, a software framework to link genomic and metabolomic data. Results are verified using publicly available data sets that include validated links.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Diagram showing the relationship between the various metabolomic and genomic objects.**
On the genomics side, BGCs are detected from microbial genomes, colour-coded by strain. These are clustered into GCFs, where each GCF contains BGCs from one or more strains. GCFs can thus also be considered as sets of strains, where each strain contributes at least one BGC to the GCF. On the metabolomics side, MS2 spectra measured in microbial cultures are grouped across strains, so that identical spectra are assigned one or more strains in which they appear. These are further grouped into MFs in a process called Molecular Networking, where each MF consists of one or more related spectra. Both spectra and MFs can likewise be considered as sets of strains where the spectrum, or a spectrum in the MF, is present in the sample for the strain. Feature-based approaches can be used to link BGCs to individual spectra, while correlation-based approaches can be used to link GCFs to either MFs or spectra, based on the pattern of strain contents.

**Fig 2. The effect of size on strain correlation scoring.**
*(A)* Size discrepancy in the strain correlation score for GCFs of varying sizes. Each box represents a strain, with filled boxes denoting that the strain is a member of the GCF or MF, and blank boxes that it is not. The top GCF-MF pair outscores the bottom pair by 30 to 26, despite the bottom pair having arguably stronger correspondence. *(B)* Expected value and variance of the strain correlation score for a population of 100 strains, as a function of GCF and MF sizes. Both the expected value and the variance have a considerable range, rendering comparison between links involving different sizes of GCFs and MFs difficult. For instance, a GCF and MF of size 80 could easily get a score of 500 or higher by chance, while for a GCF and MF of size 20, a score this high would be highly significant.

**Fig 3. Arrow diagram of the Input-Output Kernel Regression (IOKR) framework.**
X denotes the space of MS2 spectra, $Y$ is the space of metabolites, and $F$ is the shared space of molecular fingerprints. $\hat{h}$ is the (learned) mapping from MS2 to fingerprints, while ϕ is the (exact) mapping from metabolites to molecular fingerprints.

**Fig 4. Diagram of the NPLinker module.**
The NPLinker module helps with automatically linking GCFs and MFs. It integrates metabolomic and genomic data sets, using either external sources, user-provided data, or a mixture of both, and ranks potential links between metabolomic and genomic objects by given scoring functions, either built-in or user-defined.

**Fig 5. Distribution of validated links among scores.**
Distribution of the raw and standardised strain correlation scores, as well as the distribution of the scores for validated links (in black) relative to the distribution of scores for all links, in the Crüsemann data set. The standardised score has a more pronounced tail at the top end, which includes 13 out of 15 validated links, whereas many of the validated links score relatively low on the distribution of the raw scores. Figures for other data sets can be found in S1 Fig.

**Fig 6. Correlation of IOKR- and strain correlation scores.**
IOKR- and strain correlation scores for all potential links in the Crüsemann data set, with histograms of the scores. Validated links are indicated in red on the joint plot, and with black lines on the distribution histograms. Validated links are concentrated in the upper-right quadrant, i.e. score relatively high on both axes. Figures for the two further data sets can be found in S3 Fig.

**Fig 7. Scores starting from particular GCF.**
Position of the score for the validated GCF-MF pair (red) within the distribution of the scores of the links between that particular GCF and all MFs, for a selection of validated links in the Crüsemann data set (rows). The first three columns show histograms of the raw and standardised versions of the strain correlation score, as well as the IOKR score, for all links including a given GCF, with the score of the correct link indicated. The last column shows the standardised correlation score (x-axis) and IOKR score (y-axis) for the same links, again with the correct link indicated. Both IOKR and the standardised correlation scores tend to put validated links higher in the distribution of scores for the GCF in consideration, than the raw correlation score. Furthermore, some of the validated links score relatively higher on IOKR than the standardised strain correlation score, and vice versa, suggesting that the two scores complement one another. For full results, as well as for other data sets, please refer to S4 Fig.

**Fig 8. Combining scores.**
The set of points (x, y)such that ℓ_p(x, y) = 1, for three different values of p. This shows the form of the iso-lines of scores using the ℓ_p function for different values of p to combine the scores.

See this image and copyright information in PMC

Cited by

Enhanced correlation-based linking of biosynthetic gene clusters to their metabolic products through chemical class matching.
Louwen JJR, Medema MH, van der Hooft JJJ. Louwen JJR, et al. Microbiome. 2023 Jan 23;11(1):13. doi: 10.1186/s40168-022-01444-3. Microbiome. 2023. PMID: 36691088 Free PMC article.
Primed for Discovery.
Walker AS, Clardy J. Walker AS, et al. Biochemistry. 2024 Nov 5;63(21):2705-2713. doi: 10.1021/acs.biochem.4c00464. Epub 2024 Oct 15. Biochemistry. 2024. PMID: 39497571 Free PMC article. Review.
antiSMASH 8.0: extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation.
Blin K, Shaw S, Vader L, Szenei J, Reitz ZL, Augustijn HE, Cediel-Becerra JDD, de Crécy-Lagard V, Koetsier RA, Williams SE, Cruz-Morales P, Wongwas S, Segurado Luchsinger AE, Biermann F, Korenskaia A, Zdouc MM, Meijer D, Terlouw BR, van der Hooft JJJ, Ziemert N, Helfrich EJN, Masschelein J, Corre C, Chevrette MG, van Wezel GP, Medema MH, Weber T. Blin K, et al. Nucleic Acids Res. 2025 Jul 7;53(W1):W32-W38. doi: 10.1093/nar/gkaf334. Nucleic Acids Res. 2025. PMID: 40276974 Free PMC article.
Correlative metabologenomics of 110 fungi reveals metabolite-gene cluster pairs.
Caesar LK, Butun FA, Robey MT, Ayon NJ, Gupta R, Dainko D, Bok JW, Nickles G, Stankey RJ, Johnson D, Mead D, Cank KB, Earp CE, Raja HA, Oberlies NH, Keller NP, Kelleher NL. Caesar LK, et al. Nat Chem Biol. 2023 Jul;19(7):846-854. doi: 10.1038/s41589-023-01276-8. Epub 2023 Mar 6. Nat Chem Biol. 2023. PMID: 36879060 Free PMC article.
Mining genomes to illuminate the specialized chemistry of life.
Medema MH, de Rond T, Moore BS. Medema MH, et al. Nat Rev Genet. 2021 Sep;22(9):553-571. doi: 10.1038/s41576-021-00363-7. Epub 2021 Jun 3. Nat Rev Genet. 2021. PMID: 34083778 Free PMC article. Review.

See all "Cited by" articles

References

1. Newman DJ, Cragg GM. Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J Nat Prod. 2020;. 10.1021/acs.jnatprod.9b01285 - DOI - PubMed
1. Blin K, Shaw S, Steinke K, Villebro R, Ziemert N, Lee SY, et al.. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 2019;47(W1):W81–W87. 10.1093/nar/gkz310 - DOI - PMC - PubMed
1. Hannigan GD, Prihoda D, Palicka A, Soukup J, Klempir O, Rampula L, et al.. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 2019;47(18):e110. 10.1093/nar/gkz654 - DOI - PMC - PubMed
1. Cimermancic P, Medema MH, Claesen J, Kurita K, Wieland Brown LC, Mavrommatis K, et al.. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell. 2014;158(2):412–421. 10.1016/j.cell.2014.06.034 - DOI - PMC - PubMed
1. Baltz RH. Gifted microbes for genome mining and natural product discovery. J Ind Microbiol Biotechnol. 2017;44(4-5):573–588. 10.1007/s10295-016-1815-x - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

BB/R022054/1/BB_/Biotechnology and Biological Sciences Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions

Affiliations

Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources