Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 28;40(Suppl 1):i257-i265.
doi: 10.1093/bioinformatics/btae220.

SpecEncoder: deep metric learning for accurate peptide identification in proteomics

Affiliations

SpecEncoder: deep metric learning for accurate peptide identification in proteomics

Kaiyuan Liu et al. Bioinformatics. .

Abstract

Motivation: Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification.

Results: We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%-2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%-15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%-12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder's potential to enhance peptide identification for proteomic data analyses.

Availability and implementation: The source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
The architecture of the SpecEncoder model. The final loss is the sum of all matches.
Figure 2.
Figure 2.
The workflow for the hybrid searching of MS/MS spectra against the combined peptide spectral library and the protein database by using SpecEncoder. The spectra in the spectral library are first embedded into the set of library vectors; for the remaining peptides in the target protein database that do not have their known experimental spectra in the spectral library, PredFull will be exploited to predict their spectra, which are subsequently embedded into the set of database vectors. Finally, the library and database vectors will be merged into the mixed vector library that is subjected to the hybrid searching of the query spectra in an input proteomics dataset. The workflow can also simplified for the spectral library searching or the protein database search, where only the library vectors or the database vectors are used as the target, respectively.
Figure 3.
Figure 3.
The cosine similarities between pairs of MS/MS spectra (x-axis) versus the cosine similarities of their derived latent vectors (y-axis), for the spectra of the same peptides (left) and those of different peptides (right), respectively. Apparently, for the spectra of the same peptides, their derived latent vectors are almost identical (with cosine similarities close to 1), even for those spectra sharing low similarities. In contrast, the similarities between the latent vectors derived from the spectra of different peptides are relatively lower, although they are also lifted to a higher level.
Figure 4.
Figure 4.
Numbers of spectra (PSMs) and unique peptides identified by SpecEncoder and SpectraST on the charge 2+ spectra in three human proteomics datasets at peptide level FDR of 0.01.
Figure 5.
Figure 5.
Numbers of spectra (PSMs) and unique peptides identified by SpecEncoder and SpectraST on the charge 3+ spectra in three human proteomics datasets at peptide level FDR of 0.01.
Figure 6.
Figure 6.
For charge 2+ spectra, the numbers of spectra (PSMs) and unique peptides identified by MSGF+ and SpecEncoder on the 2+ spectra in the three human proteomics datasets, respectively. The 1% FDR cutoff at the peptide level (calculated on each raw file) is applied.
Figure 7.
Figure 7.
For charge 3+ spectra, the numbers of spectra (PSMs) and unique peptides identified by SpecEncoder and MSGF+ on the 2+ spectra in three human proteomics datasets, respectively. A 1% FDR cutoff at the peptide level (calculated on each raw file) is applied.

Similar articles

References

    1. Aebersold R, Mann M.. Mass spectrometry-based proteomics. Nature 2003;422:198–207. - PubMed
    1. Bai S, Kolter JZ, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. In: Proceeding of the Seventh International Conference on Learning Representations (ICLR), New Orleans, USA, May 6–May 9, 2019.
    1. Bekker-Jensen DB, Kelstrup CD, Batth TS. et al. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst 2017;4:587–99.e4. - PMC - PubMed
    1. Bittremieux W, May DH, Bilmes J. et al. A learned embedding for efficient joint analysis of millions of mass spectra. Nat Methods 2022;19:675–8. - PMC - PubMed
    1. Consortium U. Uniprot: a hub for protein information. Nucleic Acids Research 2015;43:D204–D212. - PMC - PubMed

Publication types