Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 4;18(10):3792-3799.
doi: 10.1021/acs.jproteome.9b00291. Epub 2019 Aug 30.

Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units

Affiliations

Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units

Wout Bittremieux et al. J Proteome Res. .

Abstract

Open modification searching (OMS) is a powerful search strategy to identify peptides with any type of modification. OMS works by using a very wide precursor mass window to allow modified spectra to match against their unmodified variants, after which the modification types can be inferred from the corresponding precursor mass differences. A disadvantage of this strategy, however, is the large computational cost, because each query spectrum has to be compared against a multitude of candidate peptides. We have previously introduced the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we demonstrate how this candidate selection procedure can be further optimized using graphics processing units. Additionally, we introduce a feature hashing scheme to convert high-resolution spectra to low-dimensional vectors. On the basis of these algorithmic advances, along with low-level code optimizations, the new version of ANN-SoLo is up to an order of magnitude faster than its initial version. This makes it possible to efficiently perform open searches on a large scale to gain a deeper understanding about the protein modification landscape. We demonstrate the computational efficiency and identification performance of ANN-SoLo based on a large data set of the draft human proteome. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .

Keywords: approximate nearest neighbor indexing; feature hashing; graphics processing unit; mass spectrometry; open modification searching; post-translational modification; proteomics; spectral library.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
High-resolution MS/MS spectra are first converted to sparse vectors using small mass bins to accurately capture the fragment masses. Next, these high-dimensional, sparse vectors are converted to lower-dimensional vectors through feature hashing.
Figure 2:
Figure 2:
Comparison between the spectral similarity based on the spectrum shifted dot product and the vector dot product for SSMs from the iPRG2012 data set (1 % FDR). (A) The vector dot product is obtained by binning spectra using 1 Da mass bins. (B) The vector dot product is obtained by binning spectra using 0.04 Da mass bins hashed to vectors of length 800. When using 1 Da mass bins the vector dot product often overestimates the actual spectral similarity (A; SSMs above the diagonal), while small mass bins avoid spurious peak matches (B).
Figure 3:
Figure 3:
ANN-SoLo performance improvements. Whereas an open search of the iPRG2012 data set using the previous version of ANN-SoLo took 50 min, the current version performs a similar search in under 6 min. Timing results were obtained on an Intel Xeon E5–2643 v3 processor for ANN-SoLo version 0.1.3, combined with an NVIDIA GeForce RTX 2080 GPU for ANN-SoLo version 0.2.
Figure 4:
Figure 4:
Trade-off between search speed and the number of identified spectra for the iPRG2012 data set (up and to the right is better). The number of identifications is represented as the SSM recall compared to the results of a brute-force open search without using ANN indexing. Timing results were obtained on an Intel Xeon E5–2643 v3 processor with four threads for the ANN CPU and brute-force searches, combined with an NVIDIA GeForce RTX 2080 GPU for the ANN GPU searches. Parallel execution (on the CPU or GPU) was limited to the candidate selection step. The multiple ANN results correspond to different hyperparameter configurations, with the settings that lie on the Pareto frontier shown. ANN indexing provides speed-ups of up to two orders of magnitude compared to the brute-force open search, approaching the speed of a standard search. The ANN hyperparameters can be set to achieve a higher SSM recall at the expense of a slight decrease in search speed, maximizing the number of identified spectra while still achieving a speed-up of an order of magnitude over a brute-force open search. Specific values of the ANN hyperparameters and the corresponding speed and identification performance are available in supplementary table S1.
Figure 5:
Figure 5:
Precursor mass differences for the Kim data set (table 1). Only non-zero precursor mass differences are shown, whereas the majority of SSMs correspond to unmodified peptides with a zero precursor mass difference. The five most frequent precursor mass differences are annotated with their likely modifications.

References

    1. Aebersold R, Agar JN, Amster IJ, Baker MS, et al. “How Many Human Proteoforms Are There?” In: Nature Chemical Biology 14.3 (Feb. 14, 2018), pp. 206–214. DOI: 10.1038/nchembio.2576. - DOI - PMC - PubMed
    1. Bittremieux W, Tabb DL, Impens F, Staes A, et al. “Quality Control in Mass Spectrometry-Based Proteomics.” In: Mass Spectrometry Reviews 37.5 (Sept. 2018), pp. 697–711. DOI: 10.1002/mas.21544. - DOI - PubMed
    1. Ahrné E, Müller M, and Lisacek F “Unrestricted Identification of Modified Proteins Using MS/MS.” In: PROTEOMICS 10.4 (Feb. 2010), pp. 671–686. DOI: 10.1002/pmic.200900502. - DOI - PubMed
    1. Na S and Paek E “Software Eyes for Protein Post-Translational Modifications.” In: Mass Spectrometry Reviews 34.2 (Apr. 2015), pp. 133–147. DOI: 10.1002/mas.21425. - DOI - PubMed
    1. Avtonomov DM, Kong A, and Nesvizhskii AI “DeltaMass: Automated Detection and Visualization of Mass Shifts in Proteomic Open-Search Results.” In: Journal of Proteome Research 18.2 (Feb. 1, 2019), pp. 715–720. DOI: 10.1021/acs.jproteome.8b00728. - DOI - PMC - PubMed

Publication types