Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 14;8(1):1494.
doi: 10.1038/s41467-017-01318-5.

Significance estimation for large scale metabolomics annotations by spectral matching

Affiliations

Significance estimation for large scale metabolomics annotations by spectral matching

Kerstin Scheubert et al. Nat Commun. .

Abstract

The annotation of small molecules in untargeted mass spectrometry relies on the matching of fragment spectra to reference library spectra. While various spectrum-spectrum match scores exist, the field lacks statistical methods for estimating the false discovery rates (FDR) of these annotations. We present empirical Bayes and target-decoy based methods to estimate the false discovery rate (FDR) for 70 public metabolomics data sets. We show that the spectral matching settings need to be adjusted for each project. By adjusting the scoring parameters and thresholds, the number of annotations rose, on average, by +139% (ranging from -92 up to +5705%) when compared with a default parameter set available at GNPS. The FDR estimation methods presented will enable a user to assess the scoring criteria for large scale analysis of mass spectrometry based metabolomics data that has been essential in the advancement of proteomics, transcriptomics, and genomics science.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Fig. 1
Fig. 1
False discovery rate estimation. a Overview. The empirical Bayes approach estimates FDRs from a two-component mixture of distributions representing true and false hits (positive identifications). In the target-decoy approach, query spectra are searched against a target and decoy spectral library, and FDRs are estimated from the merged and sorted list of spectrum matches. bd To construct a decoy spectral library, we implemented three methods. b Naive method: randomly adding fragment ions from the reference library to the decoy spectrum. c Spectrum-based method: fragment ions are iteratively added to the decoy spectrum, conditional on fragment ions that have previously been added. d Fragmentation tree-based method: a fragmentation tree is computed from the target spectrum, its root is relocated. New formulas of fragments are calculated according to the losses in the tree. Fragments with invalid formulas are relocated
Fig. 2
Fig. 2
Quality assessment for FDR estimations for Agilent query spectra to the GNPS library using MassBank scoring function. ah p-values. Distribution of p-values. For searching in the unfiltered target spectral library ad, p-values are estimated using the empirical Bayes approach. For searching the noise-filtered target spectral library, p-values are estimated using the fragmentation tree-based target-decoy approach eh. Distributions contain p-values from ten decoy spectral libraries. p-value distribution for both, true and false hits a, e, p-value distribution for true hits only b, f, and for false hits only c, g. By definition, the distribution of p-values for false hits has to be uniform, corresponding to the main diagonal in the p-value quantile-quantile (qq) plots d, h. The qq plots for the other methods are provided as Supplementary Fig. 1. i, j q-value plots for Agilent data (q-value plots for MassBank are provided as Supplementary Fig. 2). Estimated (y-axis) vs. true q-values (x-axis) in the unfiltered i and noise-filtered j version of the GNPS library. The small red line indicates cosine of 0.7. For the fragmentation tree-based method, we searched against the noise-filtered GNPS only, since this approach applies noise-filtering by design. The naive target-decoy approach can be seen as baseline method for comparison. For target-decoy methods, results were averaged over ten decoy spectral libraries (Supplementary Fig. 4)
Fig. 3
Fig. 3
FDR based annotations for 70 metabolomics projects. These are projects from human, microbes, plants, marine-organism, and other derived metabolomics data. The plot shows the percent gain in annotations for each of the data sets in GNPS-MassIVE at 1% and 5% FDR in relationship to the mass spectrometer used. A plot sorted by sample characteristic is provided as Supplementary Fig. 7. The impact on identification rates with the MassBank and Agilent data sets are shown on the right
Fig. 4
Fig. 4
The impact of number of matching fragment ions in a spectrum and cosine score at 1% FDR. a Frequency of data sets in relationship to number of MMP to match and cosine at 1% FDR estimation. b The number of MS/MS matches in relation to minimum matched fragment ions and cosine. An alternative 3D plot of these figures can be found in Supplementary Fig. 9

References

    1. Moran, M. A. et al. Deciphering ocean carbon in a changing world. Proceedings of the National Academy of Sciences 201514645 (2016). - PMC - PubMed
    1. Beger RD, et al. Metabolomics enables precision medicine:‘a white paper, community perspective’. Metabolomics. 2016;12:149. doi: 10.1007/s11306-016-1094-6. - DOI - PMC - PubMed
    1. Benton, H. P. et al. Autonomous metabolomics for rapid metabolite identification in global profiling. Anal. Chem. 141226090419007 (2014). - PMC - PubMed
    1. Dias DA, et al. Current and future perspectives on the structural identification of small molecules in biological systems. Metabolites. 2016;6:46. doi: 10.3390/metabo6040046. - DOI - PMC - PubMed
    1. Quinn RA, et al. Metabolomics of reef benthic interactions reveals a bioactive lipid involved in coral defence. Proc. Biol. Sci. 2016;283:1837. - PMC - PubMed

Publication types

LinkOut - more resources