Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Apr 5;12(4):1560-8.
doi: 10.1021/pr300453t. Epub 2013 Mar 8.

A new approach to evaluating statistical significance of spectral identifications

Affiliations

A new approach to evaluating statistical significance of spectral identifications

Hosein Mohimani et al. J Proteome Res. .

Abstract

While nonlinear peptide natural products such as Vancomycin and Daptomycin are among the most effective antibiotics, the computational techniques for sequencing such peptides are still in their infancy. Previous methods for sequencing peptide natural products are based on Nuclear Magnetic Resonance spectroscopy and require large amounts (milligrams) of purified materials. Recently, development of mass spectrometry-based methods has enabled accurate sequencing of nonlinear peptide natural products using picograms of material, but the question of evaluating statistical significance of Peptide Spectrum Matches (PSM) for these peptides remains open. Moreover, it is unclear how to decide whether a given spectrum is produced by a linear, cyclic, or branch-cyclic peptide. Surprisingly, all previous mass spectrometry studies overlooked the fact that a very similar problem has been successfully addressed in particle physics in 1951. In this work, we develop a method for estimating statistical significance of PSMs defined by any peptide (including linear and nonlinear). This method enables us to identify whether a peptide is linear, cyclic, or branch-cyclic, an important step toward identification of peptide natural products.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Deciding whether a peptide that produced a spectrum is linear, cyclic or branch-cyclic. Given a spectrum with unknown structure, we compute its score under different structure assumptions (e.g. linear/cyclic/branch-cyclic), and derive a p-value for each assumption. If one of the structures result in a very small p-value (e.g. linear structure with p-value of 0.0001), that structure is accepted as the most likely structure.
Figure 2
Figure 2
A) Markov chain before performing DPR, with equilibrium probabilities (0.999,0.001). B) Markov chain after performing DPR, with equilibrium probabilities (0.5,0.5). C) An example of a Markov chain with nine peptides in three score states D) Probability distribution after performing DPR with oversampling factors (μ1,μ2,μ3) = (1, 2,3). The states with decrease in probability are shown in blue, and the states with increase in probability are shown in red.
Figure 3
Figure 3
A) Illustration of all sister peptides (1,3,3), (1,1,5) and (2,2,3) for the cyclic peptide (1,2,4). B) Illustration of the Markov chain for cyclic peptides of length 3 and mass 7. We have total of four different cyclic peptides, (1,1,5), (1,2,4), (1,3,3), and (2,2,3). Each random mutation is determined by selecting i (three cases), and δ (four cases), giving rise to a total of twelve equiprobable mutations. Transition probabilities between different states of the Markov chain, derived from the uniform mutation probabilities (1/12), are also shown for each edge in the Markov chain.
Figure 4
Figure 4
(A) MS-DPR-Iteration(μ1, ⋯,μn) algorithm adapted for estimating statistical significance of PSMs. The algorithm produces peptide process Peptide0, Peptide1, ⋯, PeptideN, and their scores Score(Peptide0), Score(Peptide1), ⋯, Score(PeptideN), with equilibrium probability distribution p1,,pn satisfying pk=cμkpk for a constant c. (*) Most of the times μScore(Peptide′)/μScore(Peptide) is not integer. In that case Y would be a random variable, taking ⌈μScore(Peptide′)/μScore(Peptide)⌉ with probability p = μScore(Peptide′)/μScore(Peptide) − ⌊μScore(Peptide′)/μScore(Peptide)⌋ and ⌊μScore(Peptide′)/μScore(Peptide)⌋ with probability 1 − p. Note that in case of μ1 = ⋯ = μn = 1, this reduces to simple Monte Carlo estimation of probability distribution from N peptides. (B) MS-DPR(K) algorithm for estimating the probability distribution of scores. (**) While MS-DPR uses the same global variables as MS-DPR-Iteration, these variables are omitted for brevity.
Figure 5
Figure 5
(A) Illustration of CyclicSpectrum(Tyrocidine). (B) Illustration of BranchCyclicSpectrum(Daptomycin).
Figure 6
Figure 6
Evolution of (A) μk (B) pk for three iterations of MS-DPR. The analysis is performed for N = 1,000,000 simulated peptides of length 7, and a spectrum of peptide KYIPGTK from standard ISB database with parent mass 787. Blue, red and green plot stands for first, second, and third iterations respectively. In part (B) pGF is plotted by black. Note that the blue plot in part (B) corresponds to first iteration of MS-DPR, which simply gives the empirical p-value, pE. From the second iteration on, pDPR is very similar to pGF.
Figure 7
Figure 7
(A) Comparison of −log10 of generating function p-value with MS-DPR p-value for 1388 peptides from ISB database. Red line shows the x = y line. Correlation between the two p-values is 0.9998. Non-standard amino acid model is used, assuming each peptide has a fixed known length, and peak count score. MS-GF approach is modified accordingly, to satisfy these assumptions. (B) Comparison of −log10 of the original, publicly available MS-GF p-value with MS-DPR p-value. Correlation between the two p-values is 0.9990. Standard amino acid model is used, with the variable peptide length assumption and MS-GF score. (C) Comparison of −log10 of plin, versus −log10 of pcyc for SFTI-1, SFT-L2, SKF, SDP, and spectra from the ISB dataset. Cyclic peptides SFTI-1, SFT-L2 and SKF are shown as green stars, and linear peptide SDP is shown as a black star. Blue dots show spectra from ISB dataset, and red line shows the x = y line.
Figure 8
Figure 8
(A) Estimating the score distribution for PSMs formed by the cyclic peptide Tyrocidine A (single-stage MS). Solid line shows the distribution of scores of 109 peptides that are randomly generated. The dots show the MS-DPR p-values. (B) Similar results for the MultiStage score defined in the multistage de novo sequencing paper, for 107 peptides. Red dashed lines represent the scores of the correct peptide. The figure shows that MS-DPR p-values and empirical p-values are well correlated. Moreover, the p-value of the correct peptide is lower for multi-stage score (5e – 13) single-stage score (5e – 07), illustrating the advantage of multi-stage mass spectrometry. MS-DPR enables comparisons between arbitrary scoring functions. (C) Similar results for the score distribution for PSMs formed by the branch-cyclic peptide A21978C2 (single-stage MS).

Similar articles

Cited by

References

    1. Li J, Vederas J. Drug discovery and natural products: end of an era or an endless frontier? Science. 2009;325:161–165. - PubMed
    1. Ng J, Bandeira N, Liu W, Ghassemian M, Simmons T, Gerwick W, Linington R, Dorrestein P, Pevzner P. Dereplication and de novo sequencing of nonribosomal peptides. Nat Methods. 2009;6:596–599. - PMC - PubMed
    1. Mohimani H, Liu W, Liang Y, Gaudenico S, Fenical W, Dorrestein P, Pevzner P. Multiplex de Novo sequencing of peptide antibiotics. J Comput Biol. 2011;18:1371–1381. - PMC - PubMed
    1. Mohimani H, Liang Y, Liu W, Hsieh P, Dorrestein P, Pevzner P. Sequencing cyclic peptides by multistage mass spectrometry. J Proteomics. 2011;11:3642–3650. - PMC - PubMed
    1. Mohimani H, Liu W, Mylne J, Poth A, Colgrave M, Tran D, Selsted M, Dorrestein P, Pevzner P. Cycloquest: Identification of cyclopeptides via database search of their mass spectra against genome databases. J Prot Res. 2011;10:4505–4512. - PMC - PubMed

Publication types

LinkOut - more resources