Fast metabolite identification with Input Output Kernel Regression

Céline Brouard¹, Huibin Shen¹, Kai Dührkop², Florence d'Alché-Buc³, Sebastian Böcker², Juho Rousu¹

Affiliations

¹ Department of Computer Science, Aalto University, Espoo, Finland Helsinki Institute for Information Technology, Espoo, Finland.
² Chair for Bioinformatics, Friedrich-Schiller University, Jena, Germany.
³ LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay, Paris, France.

PMID: 27307628
PMCID: PMC4908330
DOI: 10.1093/bioinformatics/btw246

Fast metabolite identification with Input Output Kernel Regression

Céline Brouard et al. Bioinformatics. 2016.

. 2016 Jun 15;32(12):i28-i36.

doi: 10.1093/bioinformatics/btw246.

Authors

Céline Brouard¹, Huibin Shen¹, Kai Dührkop², Florence d'Alché-Buc³, Sebastian Böcker², Juho Rousu¹

Affiliations

¹ Department of Computer Science, Aalto University, Espoo, Finland Helsinki Institute for Information Technology, Espoo, Finland.
² Chair for Bioinformatics, Friedrich-Schiller University, Jena, Germany.
³ LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay, Paris, France.

PMID: 27307628
PMCID: PMC4908330
DOI: 10.1093/bioinformatics/btw246

Abstract

Motivation: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space.

Results: We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods.

Contact: celine.brouard@aalto.fi

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of the IOKR framework for solving the metabolite identification problem. The mapping f between MS/MS spectra and 2D molecular structures is learnt by approximating the output feature map $ϕ_{y}$ with a function h and solving a preimage problem

**Fig. 2.**
An example of MS/MS spectrum and its fragmentation tree. Each node of the fragmentation tree corresponds to a peak and is labeled by the molecular formula of the corresponding fragment. The root of the tree is labeled with the molecular formula of the unfragmented molecule. Edges represent the losses. Two nodes and one edge are colored to show the correspondence between the MS/MS spectrum and the fragmentation tree

**Fig. 3.**
Difference in percentage points to the percentage of metabolites ranked lower than k with CSI:FingerID using the modified Platt scoring function

**Fig. 4.**
Heatmap of the percentage of correctly identified metabolites (Top 1) with IOKR. The rows correspond to the different output kernels built on fingerprints (linear, polynomial and Gaussian) and the columns to the 24 input kernels derived from spectra and fragmentation trees, as well as the two multiple kernel combination schemes ALIGNF and UNIMKL

**Fig. 5.**
Heatmap of kernel weights learned by ALIGNF for all pairs of input and output kernels on GNPS dataset. The weights have been averaged over the 10 CV folds

**Fig. 6.**
Identified metabolites with IOKR in function of the size of candidate sets. We considered the candidate sets of size smaller than 8000, which corresponds to 98.8% of the sets, and divided them in 30 bins according to their sizes. **(a)** indicates the number of test metabolites that have a candidate set size in the corresponding size bin. The percentage of metabolites that are ranked in top 1 position, top 10 or above is shown on the **(b)** for the test metabolites falling in each size bin

**Fig. 7.**
Scatter plot of classes in ChEBI ontology with shortest paths of length 7 from the class chemical entity. X-axis corresponds to the median number of candidates associated with the compounds in each class and y-axis to the proportion of correct compounds with rank less or equal to 10 for each class. The size of the point is proportional to the number of compounds in GNPS dataset that belong to that class and we only show classes with at least 10 compounds. The classes we can identify well are shown in red and the classes we cannot are shown in blue with ChEBI id and name next to them

See this image and copyright information in PMC

References

1. Allen F. et al. (2014) CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res., 42, W94–W99. - PMC - PubMed
1. Allen F. et al. (2015) Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics, 11, 98–110.
1. Böcker S., Rasche F. (2008) Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinfomatics, 24, i49–i55. - PubMed
1. Bolton E. et al. (2008). PubChem: Integrated platform of small molecules and biological activities. Chapter 12 in Annual Reports in Computational Chemistry, vol. 4, pp. 217–241.
1. Brouard C. et al. (2011). Semi-supervised penalized output kernel regression for link prediction. In Proceedings of the 28th International Conference on Machine Learning, pp. 593–600. Bellevue, Washington, USA.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast metabolite identification with Input Output Kernel Regression

Affiliations

Fast metabolite identification with Input Output Kernel Regression

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources