. 2014 Oct 15;30(20):2941-8.

doi: 10.1093/bioinformatics/btu430. Epub 2014 Jul 7.

Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach

Tianwei Yu¹, Dean P Jones¹

Affiliations

Affiliation

¹ Department of Biostatistics and Bioinformatics, Rollins School of Public Health and Department of Medicine, School of Medicine, Emory University, Atlanta, GA 30322, USA.

PMID: 25005748
PMCID: PMC4184266
DOI: 10.1093/bioinformatics/btu430

Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach

Tianwei Yu et al. Bioinformatics. 2014.

. 2014 Oct 15;30(20):2941-8.

doi: 10.1093/bioinformatics/btu430. Epub 2014 Jul 7.

Authors

Tianwei Yu¹, Dean P Jones¹

Affiliation

¹ Department of Biostatistics and Bioinformatics, Rollins School of Public Health and Department of Medicine, School of Medicine, Emory University, Atlanta, GA 30322, USA.

PMID: 25005748
PMCID: PMC4184266
DOI: 10.1093/bioinformatics/btu430

Abstract

Motivation: Peak detection is a key step in the preprocessing of untargeted metabolomics data generated from high-resolution liquid chromatography-mass spectrometry (LC/MS). The common practice is to use filters with predetermined parameters to select peaks in the LC/MS profile. This rigid approach can cause suboptimal performance when the choice of peak model and parameters do not suit the data characteristics.

Results: Here we present a method that learns directly from various data features of the extracted ion chromatograms (EICs) to differentiate between true peak regions from noise regions in the LC/MS profile. It utilizes the knowledge of known metabolites, as well as robust machine learning approaches. Unlike currently available methods, this new approach does not assume a parametric peak shape model and allows maximum flexibility. We demonstrate the superiority of the new approach using real data. Because matching to known metabolites entails uncertainties and cannot be considered a gold standard, we also developed a probabilistic receiver-operating characteristic (pROC) approach that can incorporate uncertainties.

Availability and implementation: The new peak detection approach is implemented as part of the apLCMS package available at http://web1.sph.emory.edu/apLCMS/ CONTACT: tyu8@emory.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
The workflow of the machine learning-based peak detection approach

**Fig. 2.**
Illustration of the general idea of using matched/unmatched status as a proxy of true peaks/noise status to construct predictive models. (a) Proportion of true peaks is drastically different for matched/unmatched EICs. (b) The goal of the scoring system is to allow real peaks to be called from unmatched EICs

**Fig. 3.**
Comparing the percentage of peaks matched to known metabolite derivatives between the new machine learning approach against the existing run filter of apLCMS, and the matched filter of XCMS. All m/z values used in the training of the machine learning approach were removed. Orbitrap data generated from the NIST SRM 1950 samples was used. All three methods were allowed a number of parameter combinations. Each point represents a parameter combination. Matching was based on m/z value at the 5 ppm tolerance level. (a) Percent of newly detected features matched to the [M + H]⁺ ion forms of the half metabolites from HMDB held back from the methods. (b) Percent of newly detected peaks matched to [M + H]⁺, [M + K]⁺, [M + Na]⁺ or [M + NH₄]⁺ ion forms in the MMCD. Arrows: data used in further analysis shown in Figure 4

**Fig. 4.**
Overlapping between unique m/z values found by the new machine learning approach, apLCMS and XCMS. All m/z values used in the training of the machine learning approach were removed. Numbers in parentheses are the percentage of the peaks matched to [M + H]⁺, [M + K]⁺, [M + Na]⁺ or [M + NH₄]⁺ ion forms in MMCD. Matching between the methods and to the database was based on m/z value at the 5 ppm tolerance level

**Fig. 5.**
Comparing the percentage of peaks matched to known metabolite derivatives between the new machine learning approach against the existing run filter of apLCMS, and the matched filter of XCMS. All m/z values used in the training of the machine learning approach were removed. The data was generated from human plasma samples using LC-Fourier Transform MS, as described in Johnson *et al.* (2010). All three methods were allowed a number of parameter combinations. Each point represents a parameter combination. Matching was based on m/z value at the 5 ppm tolerance level. (a) Percent of newly detected features matched to the [M + H]⁺ ion forms of the half metabolites from HMDB held back from the methods. (b) Percent of newly detected peaks matched to [M + H]⁺, [M + K]⁺, [M + Na]⁺ or [M + NH₄]⁺ ion forms in the MMCD

See this image and copyright information in PMC

References

1. Aberg KM, et al. Feature detection and alignment of hyphenated chromatographic-mass spectrometric data. Extraction of pure ion chromatograms using Kalman tracking. J. Chromatogr. A. 2008;1192:139–146. - PubMed
1. Cui Q, et al. Metabolite identification via the Madison Metabolomics Consortium Database. Nat. Biotechnol. 2008;26:162–164. - PubMed
1. Fawcett T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006;27:861–874.
1. Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference: Prediction. Springer, New York, NY: 2009.
1. Issaq HJ, et al. Analytical and statistical approaches to metabolomics research. J. Sep. Sci. 2009;32:2183–2199. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach

Affiliation

Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources