. 2009 Jul;8(7):3737-45.

doi: 10.1021/pr801109k.

Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets

Marina Spivak¹, Jason Weston, Léon Bottou, Lukas Käll, William Stafford Noble

Affiliations

PMID: 19385687
PMCID: PMC2710313
DOI: 10.1021/pr801109k

Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets

Marina Spivak et al. J Proteome Res. 2009 Jul.

. 2009 Jul;8(7):3737-45.

doi: 10.1021/pr801109k.

Authors

Marina Spivak¹, Jason Weston, Léon Bottou, Lukas Käll, William Stafford Noble

Affiliation

¹ NEC Labs America, Princeton, New Jersey 08540, USA.

PMID: 19385687
PMCID: PMC2710313
DOI: 10.1021/pr801109k

Abstract

Shotgun proteomics coupled with database search software allows the identification of a large number of peptides in a single experiment. However, some existing search algorithms, such as SEQUEST, use score functions that are designed primarily to identify the best peptide for a given spectrum. Consequently, when comparing identifications across spectra, the SEQUEST score function Xcorr fails to discriminate accurately between correct and incorrect peptide identifications. Several machine learning methods have been proposed to address the resulting classification task of distinguishing between correct and incorrect peptide-spectrum matches (PSMs). A recent example is Percolator, which uses semisupervised learning and a decoy database search strategy to learn to distinguish between correct and incorrect PSMs identified by a database search algorithm. The current work describes three improvements to Percolator. (1) Percolator's heuristic optimization is replaced with a clear objective function, with intuitive reasons behind its choice. (2) Tractable nonlinear models are used instead of linear models, leading to improved accuracy over the original Percolator. (3) A method, Q-ranker, for directly optimizing the number of identified spectra at a specified q value is proposed, which achieves further gains.

PubMed Disclaimer

Figures

**Figure 1. Three types of loss function**
Each panel plots the loss as a function of the difference in the true and predicted label. The squared loss L(f(x), y) = (f(x) − y)² is often used in regression problems, but also in classification [22]. The hinge loss L(f(x), y) = max(0, 1 − yf(x)) is used as a convex approximation to the zero-one loss in support vector machines [8]. The sigmoid loss L(f(x), y) = 1/exp(1 + f(x)) is perhaps less commonly used, but is discussed in, e.g., [23, 27].

**Figure 2. Comparison of loss functions**
Each panel plots the number of accepted PSMs for the yeast (A) training set and (B) test set as a function of the q value threshold. Each series corresponds to one of the three loss functions shown in Figure 1, with series for Percolator and SEQUEST included for comparison.

Figure 3. “Cutting” the hinge loss makes a sigmoid-like loss called the *ramp loss*
Making the hinge loss have zero gradient when z = *y_if* (x) < s for some chosen value s effectively makes a piece-wise linear version of a sigmoid function.

**Figure 4. Comparison of Percolator, direct classification and Q-ranker**
The figure plots the number of accepted PSMs as a function of q value threshold for the yeast data set. Each series corresponds to a different ranking algorithm, including Percolator as well as linear and nonlinear versions of the direct classification algorithm and Q-ranker. The nonlinear methods use 5 hidden units.

**Figure 5. Comparison of training optimization methods (iteration vs. error rate)**
The Q-ranker optimization starts from the best result of direct optimization achieved during the course of training and continues for a further 300 iterations. These results are on the training set. Note that for each q value choice, Q-ranker improves the training error over the best result from the classification algorithm.

**Figure 6. Comparison of PeptideProphet, Percolator and Q-ranker on four data sets**
Each panel plots the number of accepted target PSMs as a function of q value. The series correspond to the three different algorithms, including two variants of Q-ranker that use 17 features and 37 features.

See this image and copyright information in PMC

References

1. Anderson DC, Li W, Payan DG, Noble WS. A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and sequest scores. Journal of Proteome Research. 2003;2(2):137–146. - PubMed
1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B. 1995;57:289–300.
1. Brosch M, Yu L, Hubbard T, Choudhary J. Accurate and sensitive peptide identification with Mascot Percolator. 2008 Submitted. - PMC - PubMed
1. Choi H, Ghosh D, Nesvizhskii A. Statistical validation of peptide identifications in large-scale proteomics using target-decoy database search strategy and flexible mixture modeling. Journal of Proteome Research. 2008;7(1):286–292. - PubMed
1. Choi H, Nesvizhskii AI. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. Journal of Proteome Research. 2008;7(1):254–265. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Grants and funding

R01 EB007057/EB/NIBIB NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets

Affiliation

Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases