. 2024 Jun 28;40(Suppl 1):i257-i265.

doi: 10.1093/bioinformatics/btae220.

SpecEncoder: deep metric learning for accurate peptide identification in proteomics

Kaiyuan Liu¹, Chenghua Tao¹, Yuzhen Ye¹, Haixu Tang¹

Affiliations

PMID: 38940141
PMCID: PMC11211836
DOI: 10.1093/bioinformatics/btae220

SpecEncoder: deep metric learning for accurate peptide identification in proteomics

Kaiyuan Liu et al. Bioinformatics. 2024.

. 2024 Jun 28;40(Suppl 1):i257-i265.

doi: 10.1093/bioinformatics/btae220.

Authors

Kaiyuan Liu¹, Chenghua Tao¹, Yuzhen Ye¹, Haixu Tang¹

Affiliation

¹ Department of Computer Science, Luddy School of Informatics, Computing and Engineering, Indiana University, IN 47408, United States.

PMID: 38940141
PMCID: PMC11211836
DOI: 10.1093/bioinformatics/btae220

Abstract

Motivation: Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification.

Results: We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%-2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%-15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%-12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder's potential to enhance peptide identification for proteomic data analyses.

Availability and implementation: The source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
The architecture of the SpecEncoder model. The final loss is the sum of all matches.

**Figure 2.**
The workflow for the hybrid searching of MS/MS spectra against the combined peptide spectral library and the protein database by using SpecEncoder. The spectra in the spectral library are first embedded into the set of *library vectors*; for the remaining peptides in the target protein database that do not have their known experimental spectra in the spectral library, PredFull will be exploited to predict their spectra, which are subsequently embedded into the set of *database vectors*. Finally, the library and database vectors will be merged into the *mixed vector library* that is subjected to the hybrid searching of the query spectra in an input proteomics dataset. The workflow can also simplified for the spectral library searching or the protein database search, where only the library vectors or the database vectors are used as the target, respectively.

**Figure 3.**
The cosine similarities between pairs of MS/MS spectra (x-axis) versus the cosine similarities of their derived latent vectors (y-axis), for the spectra of the same peptides (left) and those of different peptides (right), respectively. Apparently, for the spectra of the same peptides, their derived latent vectors are almost identical (with cosine similarities close to 1), even for those spectra sharing low similarities. In contrast, the similarities between the latent vectors derived from the spectra of different peptides are relatively lower, although they are also lifted to a higher level.

**Figure 4.**
Numbers of spectra (PSMs) and unique peptides identified by SpecEncoder and SpectraST on the charge 2+ spectra in three human proteomics datasets at peptide level FDR of 0.01.

**Figure 5.**
Numbers of spectra (PSMs) and unique peptides identified by SpecEncoder and SpectraST on the charge 3+ spectra in three human proteomics datasets at peptide level FDR of 0.01.

**Figure 6.**
For charge 2+ spectra, the numbers of spectra (PSMs) and unique peptides identified by MSGF+ and SpecEncoder on the 2+ spectra in the three human proteomics datasets, respectively. The 1% FDR cutoff at the peptide level (calculated on each raw file) is applied.

**Figure 7.**
For charge 3+ spectra, the numbers of spectra (PSMs) and unique peptides identified by SpecEncoder and MSGF+ on the 2+ spectra in three human proteomics datasets, respectively. A 1% FDR cutoff at the peptide level (calculated on each raw file) is applied.

See this image and copyright information in PMC

References

1. Aebersold R, Mann M.. Mass spectrometry-based proteomics. Nature 2003;422:198–207. - PubMed
1. Bai S, Kolter JZ, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. In: Proceeding of the Seventh International Conference on Learning Representations (ICLR), New Orleans, USA, May 6–May 9, 2019.
1. Bekker-Jensen DB, Kelstrup CD, Batth TS. et al. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst 2017;4:587–99.e4. - PMC - PubMed
1. Bittremieux W, May DH, Bilmes J. et al. A learned embedding for efficient joint analysis of millions of mass spectra. Nat Methods 2022;19:675–8. - PMC - PubMed
1. Consortium U. Uniprot: a hub for protein information. Nucleic Acids Research 2015;43:D204–D212. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SpecEncoder: deep metric learning for accurate peptide identification in proteomics

Affiliation

SpecEncoder: deep metric learning for accurate peptide identification in proteomics

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources