Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Aug;7(8):3102-13.
doi: 10.1021/pr700798h. Epub 2008 Jun 18.

Enhancing peptide identification confidence by combining search methods

Affiliations

Enhancing peptide identification confidence by combining search methods

Gelio Alves et al. J Proteome Res. 2008 Aug.

Abstract

Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The statistical calibrations for centroid data. (A) The profile E-value (using calibration formulas(4)) versus the average of the cumulative number of false positives when tested using the centroid data (A5−A8 subsets of ref (19)). (B)The statistical calibration results but after (manually) removing highly homologous peptides from the hit list and after the method-specific factor amethod (see text) has been applied. (C and D) We apply the calibrated formula (along with the method-specific factor found from the first calibration) to the subsets A1−A4 and A9−A12 of ref (19). We find that the calibration done using subset A5−A8 (when applied to other centroid data subsets) provides us with realistic statistics, supporting the universality of statistical calibration. It is worth noting that the lowest E-value in those calibration plot can only go to roughly one over the total number of spectra used for calibration. Since we used about 10 000 spectra, the lowest E-value that can be shown is of order 10−4. In real database searches, a really significant hit probably may have an E-value much smaller than 10−4 and many users may not wish to consider hits with E-values larger than 10−1.
Figure 2
Figure 2
ROC curves for the seven database search methods tested when using the centroid data (A1−A4 of ISB data set). Each search method is abbreviated by its first letter in the figure legend. ROC curves of the first type are displayed in panel A, while the ROC curves of the second type are displayed in panel B. Since the total number of spectra in this subset is about 7000, in panel B, the displayed highest number of false positives, 600, corresponds approximately to E-value 0.1. We did not show ROC curves of the second type to larger FP value because users probably will not be too interested in the large E-value regime.
Figure 3
Figure 3
Final P-value from combining a reported P-value P2 and a fixed P-value P1. The fixed P-value P1 is chosen to be either 1, 10−1, 10−2, 10−3 or 10−4. As one may see, although the relationship between P2 and the final P-value is still reasonably linear in the log − log plot, the slope has deviated from 1.
Figure 4
Figure 4
ROC curves for the seven pairwise combinations giving rise to seven largest AUC, values shown in panel A, when using the centroid data (A1−A4 subsets of the ISB data). Each search method is abbreviated by its first letter in the figure legend. ROC curves of the first type are displayed in panel A. Panel B shows ROC curves of the second type. Since the total number of spectra in this subset is about 7000, in panel B, the displayed highest number of false positives, 600, corresponds approximately to E-value of 0.1. We did not show ROC curves of the second type to larger FP value because users probably will not be too interested in the large E-value regime.
Figure 5
Figure 5
ROC curves for the seven triplets giving rise to seven largest AUC, values shown in panel A, when using the centroid data (A1−A4 subsets of the ISB data). Each search method is abbreviated by its first letter in the figure legend. ROC curves of the first type are displayed in panel A. Panel B shows ROC curves of the second type. Since the total number of spectra in this subset is about 7000, in panel B, the displayed highest number of false positives, 600, corresponds approximately to E-value of 0.1. We did not show ROC curves of the second type to larger FP value because users probably will not be too interested in the large E-value regime.
Figure 6
Figure 6
Method correlations evaluated using the centroid data (A1−A4 subsets of the ISB data). Each search method is abbreviated by its first letter in the figure legend. The panels on the first row display the RC ratio, CTP(EEc)/CFP(EEc), described in eq 11 as a function of the cutoff E-value. The panels on the second row display the likelihood of mistaking a common false hit as a significant hit, see eq 12.
Figure 7
Figure 7
Examination of the combined E-value when using the centroid data (A1−A4 subsets of the ISB data). In every panel, the average cumulative number of false hits is plotted against the combined E-value. Within the E-value range investigated, the final combined E-value is mostly within a factor of 5 of the theoretical value, represented by y = x lines. As before, each method is represented by its first letter in the figure legend.
Figure 8
Figure 8
ROC curves for the seven database search methods tested when using the profile data. Each search method is abbreviated by its first letter in the figure legend. ROC curves of the first type are displayed in panel A, while the ROC curves of the second type are displayed in panel B. Since the total number of spectra in this subset is about 7000, in panel B, the displayed highest number of false positives, 3000, corresponds approximately to E-value of 0.4. We did not show ROC curves of the second type to larger FP value because users probably will not be too interested in the large E-value regime.
Figure 9
Figure 9
ROC curves for the seven pairwise combinations giving rise to seven largest AUC, values shown in panel A, when using the profile data. Each search method is abbreviated by its first letter in the figure legend. ROC curves of the first type are displayed in panel A for the profile data. For the same data, panel B shows ROC curves of the second type. Since the total number of spectra in this subset is about 7000, in panel B, the displayed highest number of false positives, 3000, corresponds approximately to E-value of 0.4. We did not show ROC curves of the second type to larger FP value because users probably will not be too interested in the large E-value regime.
Figure 10
Figure 10
ROC curves for the seven triplets giving rise to seven largest AUC, values shown in panel A, when using the profile data. Each search method is abbreviated by its first letter in the figure legend. ROC curves of the first type are displayed in panel A. Panel B shows ROC curves of the second type. Since the total number of spectra in this subset is about 7000, in panel B, the displayed highest number of false positives, 3000, corresponds approximately to E-value of 0.4. We did not show ROC curves of the second type to larger FP value because users probably will not be too interested in the large E-value regime.
Figure 11
Figure 11
Method correlations evaluated using profile data. Each search method is abbreviated by its first letter in the figure legend. The panels on the first row display the RC ratio, CTP(EEc)/CFP(EEc), described in eq 11 as a function of the cutoff E-value. The panels on the second row display the likelihood of mistaking a common false hit as a significant hit, see eq 12.
Figure 12
Figure 12
Examination of the combined E-value when using the profile data. In every panel, the average cumulative number of false hits is plotted against the combined E-value. Within the E-value range investigated, the final combined E-value is mostly within a factor of 5 of the theoretical value, represented by y = x lines. As before, each method is represented by its first letter in the figure legend.

References

    1. Kapp E. A.; Schütz F.; Connolly L. M.; Chakel J. A.; Meza J. E.; Miller C. A.; Fenyo D.; Eng J. K.; Adkins J. N.; Omenn G. S.; Simpson R. J. An evaluation, comparison, and accurate benchmarking of several publicly available ms/ms search algorithms: sensitivity and specificity analysis. Proteomics 2005, 5, 3475–3490. - PubMed
    1. Boutilier K.; Ross M.; Podtelejnikov A. V.; Orsi C.; Taylor R.; Taylor P.; Figeys D. Comparison of different search engines using validated MS/MS test datasets. Anal. Chim. Acta 2005, 534, 11–20.
    1. Keller A.; Nesvizhskii A. I.; Kolker E.; R., A. Empirical statistical model to estimate the accuracy of peptide identifications made by ms/ms and database search. Anal. Chem. 2002, 74, 5383–5392. - PubMed
    1. Alves G.; Ogurtsov A. Y.; Wu W. W.; Wang G.; Shen R.-F.; Yu Y.-K. Calibrating E-values for MS2 library search methods. Biol. Direct 2007, 2, 26. - PMC - PubMed
    1. Benjamini Y.; Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. 1995, B57, 289–300.

Publication types