Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Nov 16;5(11):e15438.
doi: 10.1371/journal.pone.0015438.

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics

Affiliations

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics

Gelio Alves et al. PLoS One. .

Abstract

Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an E-value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic E-value reported by any method into the textbook-defined E-value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign E-values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign E-values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Illustration of APP mass grid with internal structure.
In addition to show the basic mass grid, this figure illustrates,using the peptide lengths as an example, the possibility of including additional structures in the (raw) score histogram associated with each mass index. The basic idea of obtaining the score histogram via dynamic programming is explained in the Method section. The key step to incorporate additional structure is to let the (weighted) count associated with each (raw) score be further categorized by the lengths of partial peptides reaching each mass index. In the end, one will apply the length correction factor to the raw score to obtain the real score histogram. Apparently, one may also keep track of the number of formula image (formula image) peaks accumulated within the raw score histogram. Again, the factorial contribution can be added at the end prior to the construction of the final score histogram.
Figure 2
Figure 2. Example processed spectra from different scoring functions versus the original spectrum.
The centroid spectrum used has a parent ion mass of formula image Da. In panel (A), the original spectrum is displayed; (B) shows the processed spectrum generated by the filtering protocol of RAId_DbS scoring function; (C) exhibits the processed spectrum generated by the filtering protocol of K-score; while (D) and (E) correspond respectively to the processed spectra produced by XCorr and Hyperscore.
Figure 3
Figure 3. Histograms of correlations between filtering strategies.
Used in this plot are formula image raw centroid spectra from the ISB data set . Each raw spectrum will have four different processed spectra come from each of the four different filtering strategies. The mass fragments of every filtered spectrum are then read to a mass grid. The spectrum is then viewed as a vector with non-vanishing components only at the populated component/mass indices. One then normalizes each filtered spectrum vector to unit length. An inner product of any two filtered spectral vectors represents the correlation between them. When the spectral quality does not pass a method-dependent threshold, the corresponding filtering protocol may turn the raw spectrum into a null spectrum without further searching the database. For a given pair of filtering methods and a raw spectrum, if each of the two filtering methods produces a nonempty filtered spectrum, one may turn those filtered spectra into spectral vectors and compute their inner product, i.e., their correlation. For each pair of filtering methods, these inner products are accumulated and plotted as a correlation histogram. All six pairwise combinations are shown.
Figure 4
Figure 4. Score correlations.
A subset of the ISB centroid data set was used to perform this evaluation. For each scoring function, when the best hit per spectrum (analyzed using the analysis program that the scoring function was originally used for) is a true positive, that candidate peptide is scored again using the corresponding scoring function implemented in RAId_aPS. Each true positive best hit thus gives rise to two scores and plotted using the following rule: the first score is used as the ordinate while the second score (from RAId_aPS) is used as the abscissa. Including formula image spectra, panel A is for the RAId score. Panel B is for Hyperscore and contains formula image spectra. The result of K-score is shown in panel C with formula image spectra. Shown with formula image spectra, panel D documents the results for XCorr.
Figure 5
Figure 5. E-value accuracy assessment.
The agreement between the reported formula image-value and the textbook definition is examined using centroid data (A1–A4 subsets of ISB data set). The random database size used is 500 MB. The molecular weight range considered while searching the database is formula image. In each panel, the dashed lines, corresponding to formula image and formula image, are used to provide a visual guide regarding how close/off the experimental curves are from the theoretical curve.
Figure 6
Figure 6. ROC curves for the centroid data (A1–A4 of the ISB data set [28]).
For each of the four scoring functions considered, a set of ROC curves is shown. These ROC curves include the results from running the designated program associated with that scoring function, the results from running RAId_aPS in the database search mode, and the results from combining with each of the three other scoring functions. Panel (A) shows the results from RAId score, whose designated program is RAId_DbS. Panel (B) displays the results from K-score, whose designated program is X!Tandem. Panel (C) exhibits the results from XCorr, which is mostly employed by SEQUEST. Panel (D) presents the results from Hyperscore, whose designated program is also X!Tandem. Instead of using only XCorr (like RAId_aPS), SEQUEST first selects the top formula image candidates using SP score. As shown in panel (C), for centroid data there is an advantage to filtering candidates with the SP score. However, it is also seen that by combining XCorr with either RAId score or Hyperscore, equally good results can be attained without introducing the SP score heuristics.
Figure 7
Figure 7. ROC curves for the centroid data (A1–A4 of the ISB data set [28]) when considering only the best hit per spectrum.
For each of the four scoring functions considered, a set of ROC curves is shown. These ROC curves include in the consideration only the best hit per spectrum from running the designated program associated with that scoring function, the best hit per spectrum from running RAId_aPS in the database search mode, and the best hit per spectrum from combining with each of the three other scoring functions. Panel (A) shows the results from RAId score, whose designated program is RAId_DbS. Panel (B) displays the results from K-score, whose designated program is X!Tandem. Panel (C) exhibits the results from XCorr, which is mostly employed by SEQUEST. Panel (D) presents the results from Hyperscore, whose designated program is also X!Tandem. Instead of using only XCorr (like RAId_aPS), SEQUEST first selects the top formula image candidates using SP score. As shown in panel (C), for centroid data there is advantage to filter candidates with the SP score. However, it is also seen that by combining XCorr with either RAId score or Hyperscore, equally good results can be attained without introducing the SP score heuristics.
Figure 8
Figure 8. Illustration of RAId_aPS performance when combining three different scoring functions.
Panel (A) shows the results from the profile data (NHLBI data set [4]), while panel (B) exhibits the results from the centroid data (A1–A4 of the ISB data set [28]). Panel (C) shows the results from the profile data but keeping only the best hit per spectrum, while panel (D) exhibits the results from the centroid data but keeping only the best hit per spectrum.
Figure 9
Figure 9. Example score PDF (normalized histogram) output by RAId_aPS.
An MSformula image spectrum of parent ion mass formula image Da is queried with default parameters, and the resulting score PDF for RAId, K-score, XCorr, and Hyperscore are shown respectively in panels A, B, C, and D. The number of APP within formula image 3Da of parent ion mass is about formula image.
Figure 10
Figure 10. Example of reanalyzing output files from other search engine by combining with statistical significance assignment from RAId_aPS.
In this example, we use the Mascot output files resulting from querying profile spectra (panel (A), the NHLBI data set) and centroid spectra (panel (B), A1–A4 of the ISB data set [28]) to the NCBI's nr database with proteins highly homologous to those that were present in the mixture removed. Since each data set is from a known mixture of proteins, it is possible to remove the proteins homologous to the true positives from the nr database. We then combine the calibrated formula image-value of Mascot with the formula image-value obtained from RAId_aPS when either RAId score, Hyperscore, K-score or XCorr is used.

Similar articles

Cited by

References

    1. Prakash A, Piening B, Whiteaker J, Zhang H, Shaffer SA, et al. Assessing bias in experiment design for large scale mass spectrometry-based quantitative proteomics. Mol Cell Proteomics. 2007;6:1741–1748. - PubMed
    1. Taylor CF, Paton NW, Lilley KS, Binz PA, Julian RK, et al. The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol. 2007;25:887–893. - PubMed
    1. Oberg AL, Vitek O. Statistical Design of Quantitative Mass spectrometry-Based Proteomics Experiments. J Proteome Res. 2009;8:2144–2156. - PubMed
    1. Alves G, Ogurtsov AY, Wu WW, Wang G, Shen RF, et al. Calibrating E-values for MS2 library search methods. Biology Direct. 2007;2:26. - PMC - PubMed
    1. Keller A, Nesvizhskii AI, Kolker E, R A. Empirical statistical model to estimate the accuracy of peptide identifications made by ms/ms and database search. Anal Chem. 2002;74:5383–5392. - PubMed

Publication types