Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Aug;13(8):651-656.
doi: 10.1038/nmeth.3902. Epub 2016 Jun 27.

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Affiliations

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Johannes Griss et al. Nat Methods. 2016 Aug.

Abstract

Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1
Accuracy of the spectra-cluster algorithm compared to the MSCluster and MaRaCluster algorithms. The three algorithms were tested with a test dataset built from 209 human datasets from PRIDE Archive (Online Methods, Supplementary Table 1). Clustering sensitivity (y-axis) was assessed based on the number of clustered spectra (shown as relative to the total number of spectra in the test dataset). Clustering specificity (x-axis) was assessed based on the proportion of spectra that were not identified as the most common peptide in a cluster. Only clusters with at least five spectra were taken into consideration (Supplementary Note 1).
Figure 2
Figure 2
Overview of the results of the analysis to highlight commonly found incorrect peptide identifications. (a) Overall, 424 (11%) human clusters were identified using X!Tandem, SpectraST and PepNovo. (b) The vast majority of identified peptides originated from keratin, albumin, trypsin, and haemoglobin. (c) Albumin peptides were modified more often than peptides from any other protein (center line marks the median, edges the first and third quartile, whiskers extend to +/-1.58 times the inter-quartile ratio divided by the square root of the number of observations, single points denote measurements outside this range).
Figure 3
Figure 3
Identified spectra from a diverse range of datasets, including spectra from experiments in other species, led to newly identified phosphorylated peptides in the Chromosome-Centric HPP datasets (PXD000529, PXD000533 and PXD000535). Connections between datasets are based on the shared spectra within a cluster, only taking clusters of phosphorylated peptides into consideration.
Figure 4
Figure 4
Overview of the results of the analysis of clusters containing only unidentified spectra. (a) The mass defect analysis showed that ~80% of the unidentified human spectra have a similar distribution as the background one created from all in silico digested tryptic peptides in UniProtKB/SwissProt. The remaining 20% of spectra not included within this distribution may be explained by the fact that only unmodified, fully tryptic peptides were considered for this distribution. (b) 160 (12%) of the large unidentified human clusters were identified using SpectraST, X!Tandem and PepNovo. (c) More than 50% of these identifications were peptides coming from albumin, trypsin, keratin and haemoglobin. (d) Only trypsin peptides were commonly modified (e.g. dimethylated, center line marks the median, edges the first and third quartile, whiskers extend to +/-1.58 times the inter-quartile ratio divided by the square root of the number of observations, single points denote measurements outside this range).
Figure 5
Figure 5
Summary of results for the analysis of human clusters containing only unidentified spectra. The vast majority of delta masses observed in the open modification search were between -2 and +4 Da (top left panel). After adjusting the y-axis it becomes apparent that several other delta masses were observed at high frequency (top right panel). When limiting these delta masses to only masses that were observed at least for ten different clusters, the vast majority of delta masses could be mapped to known PTMs as well as to one potential amino acid substitution (lower panel). For the complete list of the found delta masses see Supplementary Table 4.

References

    1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. - PubMed
    1. Chick JM, et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nature biotechnology. 2015;33:743–749. - PMC - PubMed
    1. Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry. 1994;5:976–989. - PubMed
    1. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–3567. - PubMed
    1. Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20:1466–1467. - PubMed