Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Affiliations

¹ Division of Immunology, Allergy and Infectious Diseases, Department of Dermatology, Medical University of Vienna, Austria; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
³ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville.
⁴ Dept. of Computer Science, University of Tübingen, Germany; Center for Bioinformatics, University of Tübingen, Germany.
⁵ Dept. of Computer Science, University of Tübingen, Germany; Center for Bioinformatics, University of Tübingen, Germany; Quantitative Biology Center, University of Tübingen, Germany; Max Planck Institute for Developmental Biology, Germany.
⁶ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom; National Center for Protein Sciences, Beijing, China.

PMID: 27493588
PMCID: PMC4968634
DOI: 10.1038/nmeth.3902

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Johannes Griss et al. Nat Methods. 2016 Aug.

. 2016 Aug;13(8):651-656.

doi: 10.1038/nmeth.3902. Epub 2016 Jun 27.

Authors

Affiliations

¹ Division of Immunology, Allergy and Infectious Diseases, Department of Dermatology, Medical University of Vienna, Austria; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
³ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville.
⁴ Dept. of Computer Science, University of Tübingen, Germany; Center for Bioinformatics, University of Tübingen, Germany.
⁵ Dept. of Computer Science, University of Tübingen, Germany; Center for Bioinformatics, University of Tübingen, Germany; Quantitative Biology Center, University of Tübingen, Germany; Max Planck Institute for Developmental Biology, Germany.
⁶ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom; National Center for Protein Sciences, Beijing, China.

PMID: 27493588
PMCID: PMC4968634
DOI: 10.1038/nmeth.3902

Abstract

Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

**Figure 1**
Accuracy of the *spectra-cluster* algorithm compared to the MSCluster and MaRaCluster algorithms. The three algorithms were tested with a test dataset built from 209 human datasets from PRIDE Archive (Online Methods, Supplementary Table 1). Clustering sensitivity (y-axis) was assessed based on the number of clustered spectra (shown as relative to the total number of spectra in the test dataset). Clustering specificity (x-axis) was assessed based on the proportion of spectra that were not identified as the most common peptide in a cluster. Only clusters with at least five spectra were taken into consideration (Supplementary Note 1).

**Figure 2**
Overview of the results of the analysis to highlight commonly found incorrect peptide identifications. **(a)** Overall, 424 (11%) human clusters were identified using X!Tandem, SpectraST and PepNovo. **(b)** The vast majority of identified peptides originated from keratin, albumin, trypsin, and haemoglobin. **(c)** Albumin peptides were modified more often than peptides from any other protein (center line marks the median, edges the first and third quartile, whiskers extend to +/-1.58 times the inter-quartile ratio divided by the square root of the number of observations, single points denote measurements outside this range).

**Figure 3**
Identified spectra from a diverse range of datasets, including spectra from experiments in other species, led to newly identified phosphorylated peptides in the Chromosome-Centric HPP datasets (PXD000529, PXD000533 and PXD000535). Connections between datasets are based on the shared spectra within a cluster, only taking clusters of phosphorylated peptides into consideration.

**Figure 4**
Overview of the results of the analysis of clusters containing only unidentified spectra. **(a)** The mass defect analysis showed that ~80% of the unidentified human spectra have a similar distribution as the background one created from all *in silico* digested tryptic peptides in UniProtKB/SwissProt. The remaining 20% of spectra not included within this distribution may be explained by the fact that only unmodified, fully tryptic peptides were considered for this distribution. **(b)** 160 (12%) of the large unidentified human clusters were identified using SpectraST, X!Tandem and PepNovo. **(c)** More than 50% of these identifications were peptides coming from albumin, trypsin, keratin and haemoglobin. **(d)** Only trypsin peptides were commonly modified (e.g. dimethylated, center line marks the median, edges the first and third quartile, whiskers extend to +/-1.58 times the inter-quartile ratio divided by the square root of the number of observations, single points denote measurements outside this range).

**Figure 5**
Summary of results for the analysis of human clusters containing only unidentified spectra. The vast majority of delta masses observed in the open modification search were between -2 and +4 Da (top left panel). After adjusting the y-axis it becomes apparent that several other delta masses were observed at high frequency (top right panel). When limiting these delta masses to only masses that were observed at least for ten different clusters, the vast majority of delta masses could be mapped to known PTMs as well as to one potential amino acid substitution (lower panel). For the complete list of the found delta masses see Supplementary Table 4.

See this image and copyright information in PMC

References

1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. - PubMed
1. Chick JM, et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nature biotechnology. 2015;33:743–749. - PMC - PubMed
1. Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry. 1994;5:976–989. - PubMed
1. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–3567. - PubMed
1. Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20:1466–1467. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Affiliations

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources