. 2024 Jun 28;40(Suppl 1):i410-i417.

doi: 10.1093/bioinformatics/btae218.

A learned score function improves the power of mass spectrometry database search

Varun Ananth¹, Justin Sanders¹, Melih Yilmaz¹, Bo Wen², Sewoong Oh¹, William Stafford Noble^{1

2}

Affiliations

¹ Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA.
² Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.

PMID: 38940129
PMCID: PMC11211853
DOI: 10.1093/bioinformatics/btae218

A learned score function improves the power of mass spectrometry database search

Varun Ananth et al. Bioinformatics. 2024.

. 2024 Jun 28;40(Suppl 1):i410-i417.

doi: 10.1093/bioinformatics/btae218.

Authors

Varun Ananth¹, Justin Sanders¹, Melih Yilmaz¹, Bo Wen², Sewoong Oh¹, William Stafford Noble^{1

2}

Affiliations

¹ Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA.
² Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.

PMID: 38940129
PMCID: PMC11211853
DOI: 10.1093/bioinformatics/btae218

Abstract

Motivation: One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools.

Results: To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Each figure plots the number of peptides detected as a function of FDR threshold for the (a) *E.coli*, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. (d) Similar to panel (c), but generated using our in-house search engine.

**Figure 2.**
Each figure plots the number of peptides detected after Percolator post-processing as a function of FDR threshold for the (a) *E.coli*, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line.

**Figure 3.**
An upset plot showing the overlap in peptide detections at 1% FDR between Casanovo-DB, Tide, SAGE, and MaxQuant on the human dataset.

**Figure 4.**
Each figure plots the number of peptides detected as a function of FDR threshold for the *E.coli* dataset, broken down by whether the precursor had an *m/z* in (a) the bottom quartile range of 350–467 *m/z* or (b) the top quartile range of 704–1389 *m/z*. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. Casanovo-DB performs much worse on low *m/z* precursors compared to those with high *m/z*, exemplifying the calibration problems which are resolved by Percolator.

See this image and copyright information in PMC

References

1. Bai W, Bilmes JA, Noble WS. Bipartite matching generalizations for peptide identification in tandem mass spectrometry. In: ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Seattle, WA, New York, NY, USA: Association for Computing Machinery 2016, 327–36.
1. Cox J, Neuhauser N, Michalski A. et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 2011;10:1794–805. - PubMed
1. Craig R, Beavis RC.. Tandem: matching proteins with tandem mass spectra. Bioinformatics 2004;20:1466–7. - PubMed
1. Diament B, Noble WS.. Faster SEQUEST searching for peptide identification from tandem mass spectra. J Proteome Res 2011;10:3871–9. - PMC - PubMed
1. Elias JE, Gygi SP.. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 2007;4:207–14. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

2245300/National Science Foundation

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A learned score function improves the power of mass spectrometry database search

Affiliations

A learned score function improves the power of mass spectrometry database search

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous