Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 28;40(Suppl 1):i410-i417.
doi: 10.1093/bioinformatics/btae218.

A learned score function improves the power of mass spectrometry database search

Affiliations

A learned score function improves the power of mass spectrometry database search

Varun Ananth et al. Bioinformatics. .

Abstract

Motivation: One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools.

Results: To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Each figure plots the number of peptides detected as a function of FDR threshold for the (a) E.coli, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. (d) Similar to panel (c), but generated using our in-house search engine.
Figure 2.
Figure 2.
Each figure plots the number of peptides detected after Percolator post-processing as a function of FDR threshold for the (a) E.coli, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line.
Figure 3.
Figure 3.
An upset plot showing the overlap in peptide detections at 1% FDR between Casanovo-DB, Tide, SAGE, and MaxQuant on the human dataset.
Figure 4.
Figure 4.
Each figure plots the number of peptides detected as a function of FDR threshold for the E.coli dataset, broken down by whether the precursor had an m/z in (a) the bottom quartile range of 350–467 m/z or (b) the top quartile range of 704–1389 m/z. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. Casanovo-DB performs much worse on low m/z precursors compared to those with high m/z, exemplifying the calibration problems which are resolved by Percolator.

Similar articles

Cited by

References

    1. Bai W, Bilmes JA, Noble WS. Bipartite matching generalizations for peptide identification in tandem mass spectrometry. In: ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Seattle, WA, New York, NY, USA: Association for Computing Machinery 2016, 327–36.
    1. Cox J, Neuhauser N, Michalski A. et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 2011;10:1794–805. - PubMed
    1. Craig R, Beavis RC.. Tandem: matching proteins with tandem mass spectra. Bioinformatics 2004;20:1466–7. - PubMed
    1. Diament B, Noble WS.. Faster SEQUEST searching for peptide identification from tandem mass spectra. J Proteome Res 2011;10:3871–9. - PMC - PubMed
    1. Elias JE, Gygi SP.. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 2007;4:207–14. - PubMed

Publication types