. 2010 Nov 16;5(11):e15438.

doi: 10.1371/journal.pone.0015438.

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics

Gelio Alves¹, Aleksey Y Ogurtsov, Yi-Kuo Yu

Affiliations

PMID: 21103371
PMCID: PMC2982831
DOI: 10.1371/journal.pone.0015438

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics

Gelio Alves et al. PLoS One. 2010.

. 2010 Nov 16;5(11):e15438.

doi: 10.1371/journal.pone.0015438.

Authors

Gelio Alves¹, Aleksey Y Ogurtsov, Yi-Kuo Yu

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America.

PMID: 21103371
PMCID: PMC2982831
DOI: 10.1371/journal.pone.0015438

Abstract

Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an E-value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic E-value reported by any method into the textbook-defined E-value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign E-values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign E-values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Illustration of APP mass grid with internal structure.**
In addition to show the basic mass grid, this figure illustrates,using the peptide lengths as an example, the possibility of including additional structures in the (raw) score histogram associated with each mass index. The basic idea of obtaining the score histogram via dynamic programming is explained in the Method section. The key step to incorporate additional structure is to let the (weighted) count associated with each (raw) score be further categorized by the lengths of partial peptides reaching each mass index. In the end, one will apply the length correction factor to the raw score to obtain the real score histogram. Apparently, one may also keep track of the number of () peaks accumulated within the raw score histogram. Again, the factorial contribution can be added at the end prior to the construction of the final score histogram.

formula image — **Figure 1. Illustration of APP mass grid with internal structure.**
In addition to show the basic mass grid, this figure illustrates,using the peptide lengths as an example, the possibility of including additional structures in the (raw) score histogram associated with each mass index. The basic idea of obtaining the score histogram via dynamic programming is explained in the Method section. The key step to incorporate additional structure is to let the (weighted) count associated with each (raw) score be further categorized by the lengths of partial peptides reaching each mass index. In the end, one will apply the length correction factor to the raw score to obtain the real score histogram. Apparently, one may also keep track of the number of () peaks accumulated within the raw score histogram. Again, the factorial contribution can be added at the end prior to the construction of the final score histogram.

**Figure 2. Example processed spectra from different scoring functions versus the original spectrum.**
The centroid spectrum used has a parent ion mass of Da. In panel (A), the original spectrum is displayed; (B) shows the processed spectrum generated by the filtering protocol of RAId_DbS scoring function; (C) exhibits the processed spectrum generated by the filtering protocol of K-score; while (D) and (E) correspond respectively to the processed spectra produced by XCorr and Hyperscore.

**Figure 3. Histograms of correlations between filtering strategies.**
Used in this plot are raw centroid spectra from the ISB data set . Each raw spectrum will have four different processed spectra come from each of the four different filtering strategies. The mass fragments of every filtered spectrum are then read to a mass grid. The spectrum is then viewed as a vector with non-vanishing components only at the populated component/mass indices. One then normalizes each *filtered* spectrum vector to unit length. An inner product of any two filtered spectral vectors represents the correlation between them. When the spectral quality does not pass a method-dependent threshold, the corresponding filtering protocol may turn the raw spectrum into a null spectrum without further searching the database. For a given pair of filtering methods and a raw spectrum, if each of the two filtering methods produces a nonempty filtered spectrum, one may turn those filtered spectra into spectral vectors and compute their inner product, i.e., their correlation. For each pair of filtering methods, these inner products are accumulated and plotted as a correlation histogram. All six pairwise combinations are shown.

**Figure 4. Score correlations.**
A subset of the ISB centroid data set was used to perform this evaluation. For each scoring function, when the best hit per spectrum (analyzed using the analysis program that the scoring function was originally used for) is a true positive, that candidate peptide is scored again using the corresponding scoring function implemented in RAId_aPS. Each true positive best hit thus gives rise to two scores and plotted using the following rule: the first score is used as the ordinate while the second score (from RAId_aPS) is used as the abscissa. Including spectra, panel A is for the RAId score. Panel B is for Hyperscore and contains spectra. The result of K-score is shown in panel C with spectra. Shown with spectra, panel D documents the results for XCorr.

**Figure 5. E-value accuracy assessment.**
The agreement between the reported -value and the textbook definition is examined using centroid data (A1–A4 subsets of ISB data set). The random database size used is 500 MB. The molecular weight range considered while searching the database is . In each panel, the dashed lines, corresponding to and , are used to provide a visual guide regarding how close/off the experimental curves are from the theoretical curve.

**Figure 6. ROC curves for the centroid data (A1–A4 of the ISB data set [28]).**
For each of the four scoring functions considered, a set of ROC curves is shown. These ROC curves include the results from running the designated program associated with that scoring function, the results from running RAId_aPS in the database search mode, and the results from combining with each of the three other scoring functions. Panel (A) shows the results from RAId score, whose designated program is RAId_DbS. Panel (B) displays the results from K-score, whose designated program is X!Tandem. Panel (C) exhibits the results from XCorr, which is mostly employed by SEQUEST. Panel (D) presents the results from Hyperscore, whose designated program is also X!Tandem. Instead of using only XCorr (like RAId_aPS), SEQUEST first selects the top candidates using SP score. As shown in panel (C), for centroid data there is an advantage to filtering candidates with the SP score. However, it is also seen that by combining XCorr with either RAId score or Hyperscore, equally good results can be attained without introducing the SP score heuristics.

**Figure 7. ROC curves for the centroid data (A1–A4 of the ISB data set [28]) when considering only the best hit per spectrum.**
For each of the four scoring functions considered, a set of ROC curves is shown. These ROC curves include in the consideration only the best hit per spectrum from running the designated program associated with that scoring function, the best hit per spectrum from running RAId_aPS in the database search mode, and the best hit per spectrum from combining with each of the three other scoring functions. Panel (A) shows the results from RAId score, whose designated program is RAId_DbS. Panel (B) displays the results from K-score, whose designated program is X!Tandem. Panel (C) exhibits the results from XCorr, which is mostly employed by SEQUEST. Panel (D) presents the results from Hyperscore, whose designated program is also X!Tandem. Instead of using only XCorr (like RAId_aPS), SEQUEST first selects the top candidates using SP score. As shown in panel (C), for centroid data there is advantage to filter candidates with the SP score. However, it is also seen that by combining XCorr with either RAId score or Hyperscore, equally good results can be attained without introducing the SP score heuristics.

**Figure 8. Illustration of RAId_aPS performance when combining three different scoring functions.**
Panel (A) shows the results from the profile data (NHLBI data set [4]), while panel (B) exhibits the results from the centroid data (A1–A4 of the ISB data set [28]). Panel (C) shows the results from the profile data but keeping only the best hit per spectrum, while panel (D) exhibits the results from the centroid data but keeping only the best hit per spectrum.

**Figure 9. Example score PDF (normalized histogram) output by RAId_aPS.**
An MS spectrum of parent ion mass Da is queried with default parameters, and the resulting score PDF for RAId, K-score, XCorr, and Hyperscore are shown respectively in panels A, B, C, and D. The number of APP within 3Da of parent ion mass is about .

**Figure 10. Example of reanalyzing output files from other search engine by combining with statistical significance assignment from RAId_aPS.**
In this example, we use the Mascot output files resulting from querying profile spectra (panel (A), the NHLBI data set) and centroid spectra (panel (B), A1–A4 of the ISB data set [28]) to the NCBI's nr database with proteins highly homologous to those that were present in the mixture removed. Since each data set is from a known mixture of proteins, it is possible to remove the proteins homologous to the true positives from the nr database. We then combine the calibrated -value of Mascot with the -value obtained from RAId_aPS when either RAId score, Hyperscore, K-score or XCorr is used.

See this image and copyright information in PMC

Cited by

Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution.
Alves G, Yu YK. Alves G, et al. Bioinformatics. 2016 Sep 1;32(17):2642-9. doi: 10.1093/bioinformatics/btw225. Epub 2016 Apr 29. Bioinformatics. 2016. PMID: 27153659 Free PMC article.
A graphical user interface for RAId, a knowledge integrated proteomics analysis suite with accurate statistics.
Joyce B, Lee D, Rubio A, Ogurtsov A, Alves G, Yu YK. Joyce B, et al. BMC Res Notes. 2018 Mar 15;11(1):182. doi: 10.1186/s13104-018-3289-6. BMC Res Notes. 2018. PMID: 29544540 Free PMC article.
On the importance of well-calibrated scores for identifying shotgun proteomics spectra.
Keich U, Noble WS. Keich U, et al. J Proteome Res. 2015 Feb 6;14(2):1147-60. doi: 10.1021/pr5010983. Epub 2014 Dec 17. J Proteome Res. 2015. PMID: 25482958 Free PMC article.
Peptide identification by tandem mass spectrometry with alternate fragmentation modes.
Guthals A, Bandeira N. Guthals A, et al. Mol Cell Proteomics. 2012 Sep;11(9):550-7. doi: 10.1074/mcp.R112.018556. Epub 2012 May 17. Mol Cell Proteomics. 2012. PMID: 22595789 Free PMC article. Review.
Improving peptide identification sensitivity in shotgun proteomics by stratification of search space.
Alves G, Yu YK. Alves G, et al. J Proteome Res. 2013 Jun 7;12(6):2571-81. doi: 10.1021/pr301139y. Epub 2013 May 29. J Proteome Res. 2013. PMID: 23668635 Free PMC article.

See all "Cited by" articles

References

1. Prakash A, Piening B, Whiteaker J, Zhang H, Shaffer SA, et al. Assessing bias in experiment design for large scale mass spectrometry-based quantitative proteomics. Mol Cell Proteomics. 2007;6:1741–1748. - PubMed
1. Taylor CF, Paton NW, Lilley KS, Binz PA, Julian RK, et al. The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol. 2007;25:887–893. - PubMed
1. Oberg AL, Vitek O. Statistical Design of Quantitative Mass spectrometry-Based Proteomics Experiments. J Proteome Res. 2009;8:2144–2156. - PubMed
1. Alves G, Ogurtsov AY, Wu WW, Wang G, Shen RF, et al. Calibrating E-values for MS2 library search methods. Biology Direct. 2007;2:26. - PMC - PubMed
1. Keller A, Nesvizhskii AI, Kolker E, R A. Empirical statistical model to estimate the accuracy of peptide identifications made by ms/ms and database search. Anal Chem. 2002;74:5383–5392. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics

Affiliation

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous