Learning score function parameters for improved spectrum identification in tandem mass spectrometry experiments

Marina Spivak¹, Michael S Bereman, Michael J Maccoss, William Stafford Noble

Affiliations

PMID: 22866926
PMCID: PMC3436966
DOI: 10.1021/pr300234m

Learning score function parameters for improved spectrum identification in tandem mass spectrometry experiments

Marina Spivak et al. J Proteome Res. 2012.

. 2012 Sep 7;11(9):4499-508.

doi: 10.1021/pr300234m. Epub 2012 Aug 15.

Authors

Marina Spivak¹, Michael S Bereman, Michael J Maccoss, William Stafford Noble

Affiliation

¹ Department of Genome Sciences, University of Washington, Seattle, Washington, USA.

PMID: 22866926
PMCID: PMC3436966
DOI: 10.1021/pr300234m

Abstract

The identification of proteins from spectra derived from a tandem mass spectrometry experiment involves several challenges: matching each observed spectrum to a peptide sequence, ranking the resulting collection of peptide-spectrum matches, assigning statistical confidence estimates to the matches, and identifying the proteins. The present work addresses algorithms to rank peptide-spectrum matches. Many of these algorithms, such as PeptideProphet, IDPicker, or Q-ranker, follow a similar methodology that includes representing peptide-spectrum matches as feature vectors and using optimization techniques to rank them. We propose a richer and more flexible feature set representation that is based on the parametrization of the SEQUEST XCorr score and that can be used by all of these algorithms. This extended feature set allows a more effective ranking of the peptide-spectrum matches based on the target-decoy strategy, in comparison to a baseline feature set devoid of these XCorr-based features. Ranking using the extended feature set gives 10-40% improvement in the number of distinct peptide identifications relative to a range of q-value thresholds. While this work is inspired by the model of the theoretical spectrum and the similarity measure between spectra used specifically by SEQUEST, the method itself can be applied to the output of any database search. Further, our approach can be trivially extended beyond XCorr to any linear operator that can serve as similarity score between experimental spectra and peptide sequences.

PubMed Disclaimer

Figures

**Figure 3. Comparison of base and extended feature sets on six replicate *C. elegans* data sets**
Panel A shows the number of unique target peptides identified in two or more replicate data sets as a function of q-value threshold for the ranking algorithm using base and extended feature sets. Panel B shows the average of absolute values of retention time differences (in minutes) of peptides identified in two or more replicate data sets as a function of number of peptides at the top of the rank list.

**Figure 4. Percent of peptide-spectrum matches that were considered “high quality” by the Bullseye algorithm**
The figure shows the percent of “Bullseye hits” among the the peptide-spectrum matches identified using the extended feature set or base feature set as a function of number of peptide-spectrum matches at the top of the ranked list in the six replicate runs.

**Figure 5. Comparison of PeptideProphet, Percolator and Q-ranker with base and extended feature sets**
Panels A–C show the number of unique target peptides identified as a function of q-value threshold. Panels D–F show the number of *known* target peptides identified as a function of q-value threshold.

**Figure 6. Comparison of RE-CID and fHCD data**
The figure shows analysis of two *C. elegans* data sets that were generated using either RE-CID and fHCD collision-induced dissociation. The blue and red lines correspond to the database search conducted using theoretical spectrum model with all peaks included, which we call the “original” search. This search was subsequently analyzed using either base or extended feature sets. The cyan line corresponds to the database search that used theoretical spectra without flanking peaks, and subsequent analysis using base feature set. The magenta line corresponds to the database search that used theoretical spectra without flanking peaks or b-ions, and subsequent analysis using base feature set.

See this image and copyright information in PMC

References

1. Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nature Methods. 2007;4(10):787–797. - PubMed
1. Eng JK, McCormack AL, Yates JR., III An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry. 1994;5:976–989. - PubMed
1. Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identification made by MS/MS and database search. Analytical Chemistry. 2002;74:5383–5392. - PubMed
1. Choi H, Nesvizhskii AI. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. Journal of Proteome Research. 2008;7(1):254–265. - PubMed
1. Ding Y, Choi H, Nesvizhskii A. Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. Journal of Proteome Research. 2008;7(11):4878–4889. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning score function parameters for improved spectrum identification in tandem mass spectrometry experiments

Affiliation

Learning score function parameters for improved spectrum identification in tandem mass spectrometry experiments

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases