Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Feb 9:8:51.
doi: 10.1186/1471-2105-8-51.

msmsEval: tandem mass spectral quality assignment for high-throughput proteomics

Affiliations

msmsEval: tandem mass spectral quality assignment for high-throughput proteomics

Jason W H Wong et al. BMC Bioinformatics. .

Abstract

Background: In proteomics experiments, database-search programs are the method of choice for protein identification from tandem mass spectra. As amino acid sequence databases grow however, computing resources required for these programs have become prohibitive, particularly in searches for modified proteins. Recently, methods to limit the number of spectra to be searched based on spectral quality have been proposed by different research groups, but rankings of spectral quality have thus far been based on arbitrary cut-off values. In this work, we develop a more readily interpretable spectral quality statistic by providing probability values for the likelihood that spectra will be identifiable.

Results: We describe an application, msmsEval, that builds on previous work by statistically modeling the spectral quality discriminant function using a Gaussian mixture model. This allows a researcher to filter spectra based on the probability that a spectrum will ultimately be identified by database searching. We show that spectra that are predicted by msmsEval to be of high quality, yet remain unidentified in standard database searches, are candidates for more intensive search strategies. Using a well studied public dataset we also show that a high proportion (83.9%) of the spectra predicted by msmsEval to be of high quality but that elude standard search strategies, are in fact interpretable.

Conclusion: msmsEval will be useful for high-throughput proteomics projects and is freely available for download from http://proteomics.ucd.ie/msmseval. Supports Windows, Mac OS X and Linux/Unix operating systems.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Modeling identifiable mass spectra using discriminant scoring of spectral features. The distributions of identifiable and unidentifiable spectra in the UCD test dataset were plotted. The number of spectra is calculated with spectra placed in bins of 0.25 for the discriminant score. The solid lines show the actual distributions of spectra while the dotted lines indicate the estimated Gaussian distributions used to model each distribution.
Figure 2
Figure 2
Removal unidentifiable spectra by msmsEval. The predicted fraction of spectra removed for identifiable (◆) and unidentifiable (x) spectra were plotted against the observed fractions for 10 runs of the UCD test dataset (A) and 22 runs of the ISB test dataset (B). The estimated fraction of spectra removed is calculated by taking the respective percentiles from the identifiable spectra Gaussian distributions. The diagonal thin dashed line shows expected trend for the removal of identifiable spectra if the estimated values match the observed values perfectly. Error bars are one standard deviation from the average of the respective test datasets. Receiver operator curves showing the fraction of identifiable spectra removed versus unidentifiable spectra removed for the UCD test dataset (solid line) and ISB dataset (dashed line) are also shown (C).
Figure 3
Figure 3
msmsEval highlights strong candidates for modified peptide spectra. The observed p(+|D) versus predicted p(+|D) values for 22 runs of the ISB dataset (A) were plotted using binned sets of 100 spectra (i.e. the fraction of the 100 spectra that were observed to be identifiable versus the mean p(+|D)). Observed p(+|D) values calculated using SEQUEST identifications (x), SEQUEST and MSAlignment/InsPecT identifications (), and SEQUEST and MSAlignment/InsPecT identifications as well as the additional assignments described in the text (o), are indicated. A pie chart (B) shows the absolute numbers and percentages of spectra from the ISB dataset with predicted p(+|D) > 0.9 that were identified by SEQUEST/MSAlignment/InsPecT, those that were additionally identified by msmsEval, and those that remain unidentified. In total, 83.9% of spectra with p(+|D) > 0.9 were identified.

References

    1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003:198–207. doi: 10.1038/nature01511. - DOI - PubMed
    1. Wolters DA, Washburn MP, Yates JR., 3rd An automated multidimensional protein identification technology for shotgun proteomics. Anal Chem. 2001;73:5683–5690. doi: 10.1021/ac010617e. - DOI - PubMed
    1. Gevaert K, Goethals M, Martens L, Van Damme J, Staes A, Thomas GR, Vandekerckhove J. Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides. Nat Biotechnol. 2003;21:566–569. doi: 10.1038/nbt810. - DOI - PubMed
    1. Eng JK, McCormack AL, Yates JR. An Approach to Correlate Tandem Mass Spectra Data of Peptides with Amino Acid Sequences in a Protein Database. J Am Soc Mass Spectrom. 1994;5:976. doi: 10.1016/1044-0305(94)80016-2. - DOI - PubMed
    1. Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20:1466–1467. doi: 10.1093/bioinformatics/bth092. - DOI - PubMed

Publication types

LinkOut - more resources