. 2013 Apr 5;12(4):1560-8.

doi: 10.1021/pr300453t. Epub 2013 Mar 8.

A new approach to evaluating statistical significance of spectral identifications

Hosein Mohimani¹, Sangtae Kim, Pavel A Pevzner

Affiliations

Affiliation

¹ Department of Electrical and Computer Engineering and ‡Department of Computer Science and Engineering, University of California-San Diego , San Diego, California 92093.

PMID: 23343606
PMCID: PMC5590104
DOI: 10.1021/pr300453t

A new approach to evaluating statistical significance of spectral identifications

Hosein Mohimani et al. J Proteome Res. 2013.

. 2013 Apr 5;12(4):1560-8.

doi: 10.1021/pr300453t. Epub 2013 Mar 8.

Authors

Hosein Mohimani¹, Sangtae Kim, Pavel A Pevzner

Affiliation

¹ Department of Electrical and Computer Engineering and ‡Department of Computer Science and Engineering, University of California-San Diego , San Diego, California 92093.

PMID: 23343606
PMCID: PMC5590104
DOI: 10.1021/pr300453t

Abstract

While nonlinear peptide natural products such as Vancomycin and Daptomycin are among the most effective antibiotics, the computational techniques for sequencing such peptides are still in their infancy. Previous methods for sequencing peptide natural products are based on Nuclear Magnetic Resonance spectroscopy and require large amounts (milligrams) of purified materials. Recently, development of mass spectrometry-based methods has enabled accurate sequencing of nonlinear peptide natural products using picograms of material, but the question of evaluating statistical significance of Peptide Spectrum Matches (PSM) for these peptides remains open. Moreover, it is unclear how to decide whether a given spectrum is produced by a linear, cyclic, or branch-cyclic peptide. Surprisingly, all previous mass spectrometry studies overlooked the fact that a very similar problem has been successfully addressed in particle physics in 1951. In this work, we develop a method for estimating statistical significance of PSMs defined by any peptide (including linear and nonlinear). This method enables us to identify whether a peptide is linear, cyclic, or branch-cyclic, an important step toward identification of peptide natural products.

PubMed Disclaimer

Figures

**Figure 1**
Deciding whether a peptide that produced a spectrum is linear, cyclic or branch-cyclic. Given a spectrum with unknown structure, we compute its score under different structure assumptions (e.g. linear/cyclic/branch-cyclic), and derive a p-value for each assumption. If one of the structures result in a very small p-value (e.g. linear structure with p-value of 0.0001), that structure is accepted as the most likely structure.

**Figure 2**
A) Markov chain before performing DPR, with equilibrium probabilities (0.999,0.001). B) Markov chain after performing DPR, with equilibrium probabilities (0.5,0.5). C) An example of a Markov chain with nine peptides in three score states D) Probability distribution after performing DPR with oversampling factors (μ₁,μ₂,μ₃) = (1, 2,3). The states with decrease in probability are shown in blue, and the states with increase in probability are shown in red.

**Figure 3**
A) Illustration of all sister peptides (1,3,3), (1,1,5) and (2,2,3) for the cyclic peptide (1,2,4). B) Illustration of the Markov chain for cyclic peptides of length 3 and mass 7. We have total of four different cyclic peptides, (1,1,5), (1,2,4), (1,3,3), and (2,2,3). Each random mutation is determined by selecting i (three cases), and δ (four cases), giving rise to a total of twelve equiprobable mutations. Transition probabilities between different states of the Markov chain, derived from the uniform mutation probabilities (1/12), are also shown for each edge in the Markov chain.

**Figure 4**
(A) MS-DPR-Iteration(μ₁, ⋯,*μ_n*) algorithm adapted for estimating statistical significance of PSMs. The algorithm produces peptide process *Peptide*₀, *Peptide*₁, ⋯, *Peptide_N*, and their scores *Score*(*Peptide*₀), *Score*(*Peptide*₁), ⋯, *Score*(*Peptide_N*), with equilibrium probability distribution $p_{1}^{'}, \dots, p_{n}^{'}$ satisfying $p_{k}^{'} = c μ_{k} p_{k}$ for a constant c. (*) Most of the times *μ_Score*₍*_Peptide*_′)/*μ_Score*₍*_Peptide*₎ is not integer. In that case Y would be a random variable, taking ⌈*μ_Score*₍*_Peptide*_′)/*μ_Score*₍*_Peptide*₎⌉ with probability p = *μ_Score*₍*_Peptide*_′)/*μ_Score*₍*_Peptide*₎ − ⌊*μ_Score*₍*_Peptide*_′)/*μ_Score*₍*_Peptide*₎⌋ and ⌊*μ_Score*₍*_Peptide*_′)/*μ_Score*₍*_Peptide*)⌋ with probability 1 − p. Note that in case of μ₁ = ⋯ = *μ_n* = 1, this reduces to simple Monte Carlo estimation of probability distribution from N peptides. (B) MS-DPR(K) algorithm for estimating the probability distribution of scores. (**) While MS-DPR uses the same global variables as MS-DPR-Iteration, these variables are omitted for brevity.

**Figure 5**
(A) Illustration of *CyclicSpectrum*(*Tyrocidine*). (B) Illustration of *BranchCyclicSpectrum*(*Daptomycin*).

**Figure 6**
Evolution of (A) *μ_k* (B) *p_k* for three iterations of MS-DPR. The analysis is performed for N = 1,000,000 simulated peptides of length 7, and a spectrum of peptide KYIPGTK from standard ISB database with parent mass 787. Blue, red and green plot stands for first, second, and third iterations respectively. In part (B) *p_GF* is plotted by black. Note that the blue plot in part (B) corresponds to first iteration of MS-DPR, which simply gives the empirical p-value, *p_E*. From the second iteration on, *p_DPR* is very similar to *p_GF*.

**Figure 7**
(A) Comparison of −*log*₁₀ of generating function p-value with MS-DPR p-value for 1388 peptides from ISB database. Red line shows the x = y line. Correlation between the two p-values is 0.9998. Non-standard amino acid model is used, assuming each peptide has a fixed known length, and peak count score. MS-GF approach is modified accordingly, to satisfy these assumptions. (B) Comparison of −*log*₁₀ of the original, publicly available MS-GF p-value with MS-DPR p-value. Correlation between the two p-values is 0.9990. Standard amino acid model is used, with the variable peptide length assumption and MS-GF score. (C) Comparison of −*log*₁₀ of *p_lin*, versus −*log*₁₀ of *p_cyc* for SFTI-1, SFT-L2, SKF, SDP, and spectra from the ISB dataset. Cyclic peptides SFTI-1, SFT-L2 and SKF are shown as green stars, and linear peptide SDP is shown as a black star. Blue dots show spectra from ISB dataset, and red line shows the x = y line.

**Figure 8**
(A) Estimating the score distribution for PSMs formed by the cyclic peptide Tyrocidine A (single-stage MS). Solid line shows the distribution of scores of 10⁹ peptides that are randomly generated. The dots show the MS-DPR p-values. (B) Similar results for the *MultiStage* score defined in the multistage de novo sequencing paper, for 10⁷ peptides. Red dashed lines represent the scores of the correct peptide. The figure shows that MS-DPR p-values and empirical p-values are well correlated. Moreover, the p-value of the correct peptide is lower for multi-stage score (5e – 13) single-stage score (5e – 07), illustrating the advantage of multi-stage mass spectrometry. MS-DPR enables comparisons between arbitrary scoring functions. (C) Similar results for the score distribution for PSMs formed by the branch-cyclic peptide A21978C2 (single-stage MS).

See this image and copyright information in PMC

Cited by

Automated genome mining of ribosomal peptide natural products.
Mohimani H, Kersten RD, Liu WT, Wang M, Purvine SO, Wu S, Brewer HM, Pasa-Tolic L, Bandeira N, Moore BS, Pevzner PA, Dorrestein PC. Mohimani H, et al. ACS Chem Biol. 2014 Jul 18;9(7):1545-51. doi: 10.1021/cb500199h. Epub 2014 May 23. ACS Chem Biol. 2014. PMID: 24802639 Free PMC article.
NRPquest: Coupling Mass Spectrometry and Genome Mining for Nonribosomal Peptide Discovery.
Mohimani H, Liu WT, Kersten RD, Moore BS, Dorrestein PC, Pevzner PA. Mohimani H, et al. J Nat Prod. 2014 Aug 22;77(8):1902-9. doi: 10.1021/np500370c. Epub 2014 Aug 12. J Nat Prod. 2014. PMID: 25116163 Free PMC article.
Dereplication of microbial metabolites through database search of mass spectra.
Mohimani H, Gurevich A, Shlemov A, Mikheenko A, Korobeynikov A, Cao L, Shcherbin E, Nothias LF, Dorrestein PC, Pevzner PA. Mohimani H, et al. Nat Commun. 2018 Oct 2;9(1):4035. doi: 10.1038/s41467-018-06082-8. Nat Commun. 2018. PMID: 30279420 Free PMC article.
Metabolomics and genomics in natural products research: complementary tools for targeting new chemical entities.
Caesar LK, Montaser R, Keller NP, Kelleher NL. Caesar LK, et al. Nat Prod Rep. 2021 Nov 17;38(11):2041-2065. doi: 10.1039/d1np00036e. Nat Prod Rep. 2021. PMID: 34787623 Free PMC article. Review.
Dereplication, sequencing and identification of peptidic natural products: from genome mining to peptidogenomics to spectral networks.
Mohimani H, Pevzner PA. Mohimani H, et al. Nat Prod Rep. 2016 Jan;33(1):73-86. doi: 10.1039/c5np00050e. Nat Prod Rep. 2016. PMID: 26497201 Free PMC article. Review.

See all "Cited by" articles

References

1. Li J, Vederas J. Drug discovery and natural products: end of an era or an endless frontier? Science. 2009;325:161–165. - PubMed
1. Ng J, Bandeira N, Liu W, Ghassemian M, Simmons T, Gerwick W, Linington R, Dorrestein P, Pevzner P. Dereplication and de novo sequencing of nonribosomal peptides. Nat Methods. 2009;6:596–599. - PMC - PubMed
1. Mohimani H, Liu W, Liang Y, Gaudenico S, Fenical W, Dorrestein P, Pevzner P. Multiplex de Novo sequencing of peptide antibiotics. J Comput Biol. 2011;18:1371–1381. - PMC - PubMed
1. Mohimani H, Liang Y, Liu W, Hsieh P, Dorrestein P, Pevzner P. Sequencing cyclic peptides by multistage mass spectrometry. J Proteomics. 2011;11:3642–3650. - PMC - PubMed
1. Mohimani H, Liu W, Mylne J, Poth A, Colgrave M, Tran D, Selsted M, Dorrestein P, Pevzner P. Cycloquest: Identification of cyclopeptides via database search of their mass spectra against genome databases. J Prot Res. 2011;10:4505–4512. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A new approach to evaluating statistical significance of spectral identifications

Affiliation

A new approach to evaluating statistical significance of spectral identifications

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous