Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun;19(6):675-678.
doi: 10.1038/s41592-022-01496-1. Epub 2022 May 30.

A learned embedding for efficient joint analysis of millions of mass spectra

Affiliations

A learned embedding for efficient joint analysis of millions of mass spectra

Wout Bittremieux et al. Nat Methods. 2022 Jun.

Abstract

Computational methods that aim to exploit publicly available mass spectrometry repositories rely primarily on unsupervised clustering of spectra. Here we trained a deep neural network in a supervised fashion on the basis of previous assignments of peptides to spectra. The network, called 'GLEAMS', learns to embed spectra in a low-dimensional space in which spectra generated by the same peptide are close to one another. We applied GLEAMS for large-scale spectrum clustering, detecting groups of unidentified, proximal spectra representing the same peptide. We used these clusters to explore the dark proteome of repeatedly observed yet consistently unidentified mass spectra.

PubMed Disclaimer

Figures

Extended Data Fig. 1
Extended Data Fig. 1. GLEAMS embedder network.
Each instance of the embedder network in the Siamese neural network separately receives each of three feature types as input. Precursor features are processed through a fully-connected network with two layers of sizes 32 and 5. Binned fragment intensities are processed through five blocks of one-dimensional convolutional layers and max pooling layers. Reference spectra features are processed through a fully-connected network with two layers of sizes 750 and 250. The output of the three subnetworks is concatenated and passed to a final fully-connected layer of size 32.
Extended Data Fig. 2
Extended Data Fig. 2. UMAP visualization of embeddings, colored by precursor charge.
UMAP projection of 685,337 embeddings from frequently occurring peptides in 10 million randomly selected identified spectra. Note that the visualization may group peptides with similarities on some dimensions of the 32-dimensional embedding space, but which are nevertheless distinguishable based on their full embeddings.
Extended Data Fig. 3
Extended Data Fig. 3. False negative rate between positive and negative embedding pairs.
The false negative rate between positive and negative embedding pairs for 10 million randomly selected pairs from the test dataset, at distance threshold 0.5455 (grey line), corresponding to 1% false discovery rate, is 1%.
Extended Data Fig. 4
Extended Data Fig. 4. ROC curve for GLEAMS performance on unseen phosphorylated spectra.
Receiver operating characteristic (ROC) curve for GLEAMS embeddings corresponding to 7.5 million randomly selected spectrum pairs from an independent phosphoproteomics study. The ROC curve and area under the curve (AUC) show how often a same-peptide spectrum pair had a smaller distance than a different-peptide spectrum pair.
Extended Data Fig. 5
Extended Data Fig. 5. Clustering result characteristics produced by different tools.
Clustering result characteristics at approximately 1% incorrectly clustered spectra over three random folds of the test dataset. (A) Complementary empirical cumulative distribution of the cluster sizes. (B) The number of datasets that spectra in the test dataset originate from per cluster (24 datasets total).
Extended Data Fig. 6
Extended Data Fig. 6. GLEAMS performance with different clustering algorithms.
Average clustering performance over three random folds of the test dataset containing 28 million MS/MS spectra each. The GLEAMS embeddings were clustered using hierarchical clustering with complete linkage, single linkage, or average linkage; or using DBSCAN. The performance of alternative spectrum clustering tools (Figure 1D-E) is shown in gray for reference. (A) The number of clustered spectra versus the number of incorrectly clustered spectra per clustering algorithm. (B) Cluster completeness versus the number of incorrectly clustered spectra per clustering algorithm
Extended Data Fig. 7
Extended Data Fig. 7. Runtime scalability of spectrum clustering tools.
Scalability of spectrum clustering tools when processing increasingly large data volumes. Three random subsets of the test dataset were combined to form input datasets consisting of 28 million, 56 million, and 84 million spectra. Evaluations of falcon and MS-Cluster on larger datasets were excluded due to excessive runtimes.
Extended Data Fig. 8
Extended Data Fig. 8. UMAP visualization of the selected reference spectra.
UMAP visualization of the selected reference spectra. The two-dimensional UMAP visualization was computed from the dot product pairwise similarity matrix between all 200,000 randomly selected spectra from the training data.
Extended Data Fig. 9
Extended Data Fig. 9. Input features ablation test.
Ablation testing during training of the GLEAMS Siamese network shows the benefit of the different input feature types. The performance is measured using the validation loss while training for 20 iterations consisting of 40,000 steps with batch size 256. The line indicates the smoothed average validation loss over five consecutive iterations, with the markers showing the individual validation losses at the end of each iteration.
Figure 1
Figure 1
GLEAMS deep neural network architecture and embedding performance. a. Two spectra, S1 and S2, are encoded to vectors and passed as input to two instances of the embedder network with tied weights. The Euclidean distance between the two resulting embeddings, GW(S1) and GW(S2), is passed to a contrastive loss function that penalizes dissimilar embeddings that correspond to the same peptide and similar embeddings that correspond to different peptides, up to a margin of 1. b. UMAP projection of 685,337 embeddings from frequently occurring peptides in 10 million randomly selected identified spectra from the test dataset. c. Proportion of neighbors that have the same peptide label as a function of the distance threshold for 186,865,330 pairwise distances between 10 million randomly selected embeddings from the test dataset. Embeddings at small distances represent the same peptide (“Original”), while the majority of close neighbors with different peptide labels correspond to peptides with ambiguously localized modifications (“Unmodified”). d-e. Average clustering performance over three random folds of the test dataset containing 28 million MS/MS spectra each. d. The number of clustered spectra versus the number of incorrectly clustered spectra per clustering algorithm. e. Cluster completeness versus the number of incorrectly clustered spectra per clustering algorithm.
Figure 2
Figure 2
Exploration of the dark proteome using GLEAMS to process previously unidentified spectra. a. GLEAMS identified 71% additional PSMs (blue) compared to the original MassIVE-KB results (dark pink) by performing targeted open modification searching of cluster medoid spectra and propagating peptide labels within clusters. Several high-quality clustered, yet unidentified spectra (yellow) remain to further explore the dark proteome. b. Precursor delta masses observed from open modification searching. Some of the most frequent delta masses are annotated with their likely modifications, sourced from Unimod. See Supplementary Table 2 for details of the top ~500 observed precursor mass differences.

References

    1. Tabb DL The SEQUEST Family Tree. Journal of the American Society for Mass Spectrometry 2015, 26, 1814–1819, DOI: 10.1007/s13361-015-1201-3. - DOI - PMC - PubMed
    1. Perez-Riverol Y, Csordas A, Bai J, Bernal-Llinares M, et al. The PRIDE Database and Related Tools and Resources in 2019: Improving Support for Quantification Data. Nucleic Acids Research 2019, 47, D442–D450, DOI: 10.1093/nar/gky1106. - DOI - PMC - PubMed
    1. Frank AM, Bandeira N, Shen Z, Tanner S, et al. Clustering Millions of Tandem Mass Spectra. Journal of Proteome Research 2008, 7, 113–122, DOI: 10.1021/pr070361e. - DOI - PMC - PubMed
    1. Griss J, Foster JM, Hermjakob H, Vizcaíno JA PRIDE Cluster: Building a Consensus of Proteomics Data. Nature Methods 2013, 10, 95–96, DOI: 10.1038/nmeth.2343. - DOI - PMC - PubMed
    1. Griss J, Perez-Riverol Y, Lewis S, Tabb DL, et al. Recognizing Millions of Consistently Unidentified Spectra across Hundreds of Shotgun Proteomics Datasets. Nature Methods 2016, 13, 651–656, DOI: 10.1038/nmeth.3902. - DOI - PMC - PubMed

Publication types