Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 7;22(7):2246-2255.
doi: 10.1021/acs.jproteome.2c00807. Epub 2023 May 26.

AIomics: Exploring More of the Proteome Using Mass Spectral Libraries Extended by Artificial Intelligence

Affiliations

AIomics: Exploring More of the Proteome Using Mass Spectral Libraries Extended by Artificial Intelligence

Lewis Y Geer et al. J Proteome Res. .

Abstract

The unbounded permutations of biological molecules, including proteins and their constituent peptides, present a dilemma in identifying the components of complex biosamples. Sequence search algorithms used to identify peptide spectra can be expanded to cover larger classes of molecules, including more modifications, isoforms, and atypical cleavage, but at the cost of false positives or false negatives due to the simplified spectra they compute from sequence records. Spectral library searching can help solve this issue by precisely matching experimental spectra to library spectra with excellent sensitivity and specificity. However, compiling spectral libraries that span entire proteomes is pragmatically difficult. Neural networks that predict complete spectra containing a full range of annotated and unannotated ions can be used to replace these simplified spectra with libraries of fully predicted spectra, including modified peptides. Using such a network, we created predicted spectral libraries that were used to rescore matches from a sequence search done over a large search space, including a large number of modifications. Rescoring improved the separation of true and false hits by 82%, yielding an 8% increase in peptide identifications, including a 21% increase in nonspecifically cleaved peptides and a 17% increase in phosphopeptides.

Keywords: algorithms; machine learning; peptides; proteome analysis; search engine methods; tandem mass spectrometry.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests. All commercial instruments, software, and materials used in the study are for experimental purposes only. Such identification does not intend recommendation or endorsement by the National Institute of Standards and Technology, nor does it intend that the materials, software, or instruments used are necessarily the best available for the purpose.

Figures

Figure 1.
Figure 1.
Mirror plots of (A) a non-tryptic spectrum with score near the median prediction score and (B) a phosphopeptide spectrum with score near the bottom 10th percentile score. The blue spectrum at the top of each mirror plot is the experimental spectrum from the test set, annotated by ion series, including immonium ions (IQA, etc.), parent ions (p), and ions containing carbon-13 (+i). The red spectrum below is the matching predicted spectrum. The predicted phosphopeptide spectrum contains neutral loss ions that are useful for identifying and localizing phosphosites, as well as internal ions (Int/), immonium ions, and unannotated ions.
Figure 2.
Figure 2.
(A) Histogram of the similarity score S calculated between the experimental spectra and predicted spectra in the test set. (B) Histogram of S for the subset of the test set that contains the spectra of phosphopeptides. (C) Histogram of S for TMT derivatized peptides. (D) Histogram of S for non-tryptic peptides.
Figure 3.
Figure 3.
(A) Histogram of the Mascot ions score for both true and false matches to the test set spectra, searching against the human, mouse, and Chinese hamster proteome. (B) histogram of the corrected S score as applied to predicted spectra for the same search results as (A). The separation between true and false matches is improved by 82%.
Figure 4.
Figure 4.
FDR analysis done before and after rescoring using test spectra as the queries. True matches are from the human, mouse, and Chinese hamster proteome and defined as those that match the peptide sequence, charge, and modifications of the query. Spectra were predicted for each Mascot sequence search match and a corrected Stein-Scott dot product S calculated to rescore the matches.

References

    1. Eng JK; Jahan TA; Hoopmann MR Comet: An Open-Source MS/MS Sequence Database Search Tool. PROTEOMICS 2013, 13 (1), 22–24. 10.1002/pmic.201200439. - DOI - PubMed
    1. Perkins DN; Pappin DJC; Creasy DM; Cottrell JS Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data. Electrophoresis 1999, 20 (18), 3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. - DOI - PubMed
    1. Shilov IV; Seymour SL; Patel AA; Loboda A; Tang WH; Keating SP; Hunter CL; Nuwaysir LM; Schaeffer DA The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra. Mol. Cell. Proteomics 2007, 6 (9), 1638–1655. 10.1074/mcp.T600050-MCP200. - DOI - PubMed
    1. Eng JK; McCormack AL; Yates JR An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass Spectrom 1994, 5 (11), 976–989. 10.1016/1044-0305(94)80016-2. - DOI - PubMed
    1. Craig R; Beavis RC TANDEM: Matching Proteins with Tandem Mass Spectra. Bioinformatics 2004, 20 (9), 1466–1467. 10.1093/bioinformatics/bth092. - DOI - PubMed

Publication types

LinkOut - more resources