Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct 15;30(20):2941-8.
doi: 10.1093/bioinformatics/btu430. Epub 2014 Jul 7.

Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach

Affiliations

Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach

Tianwei Yu et al. Bioinformatics. .

Abstract

Motivation: Peak detection is a key step in the preprocessing of untargeted metabolomics data generated from high-resolution liquid chromatography-mass spectrometry (LC/MS). The common practice is to use filters with predetermined parameters to select peaks in the LC/MS profile. This rigid approach can cause suboptimal performance when the choice of peak model and parameters do not suit the data characteristics.

Results: Here we present a method that learns directly from various data features of the extracted ion chromatograms (EICs) to differentiate between true peak regions from noise regions in the LC/MS profile. It utilizes the knowledge of known metabolites, as well as robust machine learning approaches. Unlike currently available methods, this new approach does not assume a parametric peak shape model and allows maximum flexibility. We demonstrate the superiority of the new approach using real data. Because matching to known metabolites entails uncertainties and cannot be considered a gold standard, we also developed a probabilistic receiver-operating characteristic (pROC) approach that can incorporate uncertainties.

Availability and implementation: The new peak detection approach is implemented as part of the apLCMS package available at http://web1.sph.emory.edu/apLCMS/ CONTACT: tyu8@emory.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The workflow of the machine learning-based peak detection approach
Fig. 2.
Fig. 2.
Illustration of the general idea of using matched/unmatched status as a proxy of true peaks/noise status to construct predictive models. (a) Proportion of true peaks is drastically different for matched/unmatched EICs. (b) The goal of the scoring system is to allow real peaks to be called from unmatched EICs
Fig. 3.
Fig. 3.
Comparing the percentage of peaks matched to known metabolite derivatives between the new machine learning approach against the existing run filter of apLCMS, and the matched filter of XCMS. All m/z values used in the training of the machine learning approach were removed. Orbitrap data generated from the NIST SRM 1950 samples was used. All three methods were allowed a number of parameter combinations. Each point represents a parameter combination. Matching was based on m/z value at the 5 ppm tolerance level. (a) Percent of newly detected features matched to the [M + H]+ ion forms of the half metabolites from HMDB held back from the methods. (b) Percent of newly detected peaks matched to [M + H]+, [M + K]+, [M + Na]+ or [M + NH4]+ ion forms in the MMCD. Arrows: data used in further analysis shown in Figure 4
Fig. 4.
Fig. 4.
Overlapping between unique m/z values found by the new machine learning approach, apLCMS and XCMS. All m/z values used in the training of the machine learning approach were removed. Numbers in parentheses are the percentage of the peaks matched to [M + H]+, [M + K]+, [M + Na]+ or [M + NH4]+ ion forms in MMCD. Matching between the methods and to the database was based on m/z value at the 5 ppm tolerance level
Fig. 5.
Fig. 5.
Comparing the percentage of peaks matched to known metabolite derivatives between the new machine learning approach against the existing run filter of apLCMS, and the matched filter of XCMS. All m/z values used in the training of the machine learning approach were removed. The data was generated from human plasma samples using LC-Fourier Transform MS, as described in Johnson et al. (2010). All three methods were allowed a number of parameter combinations. Each point represents a parameter combination. Matching was based on m/z value at the 5 ppm tolerance level. (a) Percent of newly detected features matched to the [M + H]+ ion forms of the half metabolites from HMDB held back from the methods. (b) Percent of newly detected peaks matched to [M + H]+, [M + K]+, [M + Na]+ or [M + NH4]+ ion forms in the MMCD

References

    1. Aberg KM, et al. Feature detection and alignment of hyphenated chromatographic-mass spectrometric data. Extraction of pure ion chromatograms using Kalman tracking. J. Chromatogr. A. 2008;1192:139–146. - PubMed
    1. Cui Q, et al. Metabolite identification via the Madison Metabolomics Consortium Database. Nat. Biotechnol. 2008;26:162–164. - PubMed
    1. Fawcett T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006;27:861–874.
    1. Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference: Prediction. Springer, New York, NY: 2009.
    1. Issaq HJ, et al. Analytical and statistical approaches to metabolomics research. J. Sep. Sci. 2009;32:2183–2199. - PubMed

Publication types