Gapped spectral dictionaries and their applications for database searches of tandem mass spectra

doi:10.1074/mcp.M110.002220

. 2011 Jun;10(6):M110.002220.

doi: 10.1074/mcp.M110.002220. Epub 2011 Mar 28.

Gapped spectral dictionaries and their applications for database searches of tandem mass spectra

Kyowon Jeong¹, Sangtae Kim, Nuno Bandeira, Pavel A Pevzner

Affiliations

PMID: 21444829
PMCID: PMC3108828
DOI: 10.1074/mcp.M110.002220

Gapped spectral dictionaries and their applications for database searches of tandem mass spectra

Kyowon Jeong et al. Mol Cell Proteomics. 2011 Jun.

. 2011 Jun;10(6):M110.002220.

doi: 10.1074/mcp.M110.002220. Epub 2011 Mar 28.

Authors

Kyowon Jeong¹, Sangtae Kim, Nuno Bandeira, Pavel A Pevzner

Affiliation

¹ Department of Electrical and Computer Engineering, University of California, San Diego, CA, USA.

PMID: 21444829
PMCID: PMC3108828
DOI: 10.1074/mcp.M110.002220

Abstract

Generating all plausible de novo interpretations of a peptide tandem mass (MS/MS) spectrum (Spectral Dictionary) and quickly matching them against the database represent a recently emerged alternative approach to peptide identification. However, the sizes of the Spectral Dictionaries quickly grow with the peptide length making their generation impractical for long peptides. We introduce Gapped Spectral Dictionaries (all plausible de novo interpretations with gaps) that can be easily generated for any peptide length thus addressing the limitation of the Spectral Dictionary approach. We show that Gapped Spectral Dictionaries are small thus opening a possibility of using them to speed-up MS/MS searches. Our MS-Gapped-Dictionary algorithm (based on Gapped Spectral Dictionaries) enables proteogenomics applications (such as searches in the six-frame translation of the human genome) that are prohibitively time consuming with existing approaches. MS-Gapped-Dictionary generates gapped peptides that occupy a niche between accurate but short peptide sequence tags and long but inaccurate full length peptide reconstructions. We show that, contrary to conventional wisdom, some high-quality spectra do not have good peptide sequence tags and introduce gapped tags that have advantages over the conventional peptide sequence tags in MS/MS database searches.

PubMed Disclaimer

Figures

**Fig. 1.**
**Spectra for the peptide LNRVSQGK (A) and AIIDAIVSGELK (B) identified by InsPecT (release 20090910) database search.**

**Fig. 2.**
**Different modules of MS-gappeddictionary.**

**Fig. 3.**
*Left panel:* Illustration of the dynamic programming algorithm for computing the generating function of graph G shown in (A). The nodes of the *dynamic programming (DP) graph* (B) are defined as pairs (v,x), where v is a vertex of G and x is a score. Two nodes (v,x) and (v′,x′) are connected by an edge if and only if there exists an edge between vertices v and v′ in G with score x′-x. The probability of an edge between (v,x) and (v′,x′) in the DP graph equals to the probability of the edge (v,v′) in G. A source s in graph G corresponds to a single node (s,0) in the DP graph. A node (v,x) is present in the DP graph if and only if there exist a path from (s,0) to (v,x). In this example, red (blue) edges of the DP graph in (B) are from the red (blue) edges of the graph G in (A). All edge probabilities in (B) are 0.5 as the probabilities of edges of G are 0.5. The *node probability* of node (v,x) (shown inside nodes in (B) and (C)) is the total probability of the paths from the source s to v with the score x. The node probability of the source of the DP graph is initialized by 1, and the node probability of a node (v,x) is obtained by the *weighted* summation of the node probabilities of its *predecessors* (see (21)). The generating function is represented by the probabilities of the sink nodes in the DP graph. To find all paths of score x from the source to the sink in graph G one has to backtrack all paths from the node (t,x) in the DP graph. For example, if x = 2, two such paths are found: {s, v₂, v₄, v₇, t} and {s, v₃, v₆, t} as in (C). *Right panel:* Path Dictionary and Gapped Path Dictionary. (A) PD(G,1) and the generating function of G. (B) The construction of *G_H* using edges between hubs v² and t (shown as solid blue and red edges) as examples. Solid blue and red edges in *G_H* are induced by dashed blue and red paths in G. All paths that use only nonhub vertices in G are collapsed into edges in *G_H*. (C) The hub graph *G_H*, *GPD*(G,H,1), and the generating function of *G_H*.

**Fig. 4.**
**Gapped Spectral Dictionary size *versus* Spectral Dictionary size (for varying peptide length and number of hubs) for the Shewanella data set.**

**Fig. 5.**
**Distribution of the lengths of the gapped peptides induced by correct peptides (for 20 hubs) for the Shewanella data set.** (see Supplement Fig. S2 for different parameters).

**Fig. 6.**
**Identifiability of the δ-reduced Gapped Spectral Dictionaries from the Shewanella data set for δ = 5 (A), δ = 7 (B), and δ = 9 (C).**

**Fig. 7.**
**Average rank of (the best ranked) correct gapped peptides.** The average ranking does not exceed 80 regardless of the peptide length (for δ = 5,7,9). The number of hubs is 20. The dotted lines with open circles at the ends represent the range that the rankings fall into 90% of the time.

**Fig. 8.**
**The probability that a correct gapped peptide is found within k top-ranked peptides in the δ-reduced Gapped Spectral Dictionary.** The number of hubs is 20, and δ = 5 (see Supplement Fig. S3 for different parameters).

**Fig. 9.**
**Identifiability of the Pocket Dictionaries from the Shewanella data set for δ = 5 (A), δ = 7 (B), and δ = 9 (C).** The number of hubs is 20. Even for long peptides, Pocket Dictionaries with 50 gapped peptides are sufficient to ensure the identifiability higher than 97% when δ is 5. When δ is large, larger Pocket Dictionaries are needed. (see Fig S4 in Supplemental).

**Fig. 10.**
**Comparison of gapped tags generated from the Pocket Dictionaries and the peptide sequence tags generated by InsPecT (on spectra from the Standard data set).**

**Fig. 11.**
**The FDR curves for MS-GappedDictionary (using either gapped tag or gapped peptides), OMSSA, InsPecT, and MS-Dictionary (peptide-level FDR is reported (**32)). For each spectrum, only the single best matching peptide is reported.

**Fig. 12.**
The length distribution of peptides with the spectral probability less than 10^–13 (corresponding FDR ≈1%) in HEK data set identified by MS-GappedDictionary and MS-Dictionary in the six-frame translation of the human genome. MS-Dictionary identifies less peptides than MS-GappedDictionary when the peptide length is longer than 13.

See this image and copyright information in PMC

Cited by

MS-GF+ makes progress towards a universal database search tool for proteomics.
Kim S, Pevzner PA. Kim S, et al. Nat Commun. 2014 Oct 31;5:5277. doi: 10.1038/ncomms6277. Nat Commun. 2014. PMID: 25358478 Free PMC article.
Speeding up tandem mass spectral identification using indexes.
Liu X, Mammana A, Bafna V. Liu X, et al. Bioinformatics. 2012 Jul 1;28(13):1692-7. doi: 10.1093/bioinformatics/bts244. Epub 2012 Apr 27. Bioinformatics. 2012. PMID: 22543365 Free PMC article.
De novo sequencing and homology searching.
Ma B, Johnson R. Ma B, et al. Mol Cell Proteomics. 2012 Feb;11(2):O111.014902. doi: 10.1074/mcp.O111.014902. Epub 2011 Nov 16. Mol Cell Proteomics. 2012. PMID: 22090170 Free PMC article.
Spectrum Identification using a Dynamic Bayesian Network Model of Tandem Mass Spectra.
Singh AP, Halloran J, Bilmes JA, Kirchoff K, Noble WS. Singh AP, et al. Uncertain Artif Intell. 2012 Aug;28:775-785. Uncertain Artif Intell. 2012. PMID: 25383048 Free PMC article.
The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search.
Kim S, Mischerikow N, Bandeira N, Navarro JD, Wich L, Mohammed S, Heck AJ, Pevzner PA. Kim S, et al. Mol Cell Proteomics. 2010 Dec;9(12):2840-52. doi: 10.1074/mcp.M110.003731. Epub 2010 Sep 9. Mol Cell Proteomics. 2010. PMID: 20829449 Free PMC article.

See all "Cited by" articles

References

1. Ma B., Zhang K., Hendrie C., Liang C., Li M., Doherty-Kirby A., Lajoie G. (2003) PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 - PubMed
1. Frank A., Pevzner P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 - PubMed
1. Frank A. (2009) A ranking-based Scoring Function for peptide-spectrum matches. J. Proteome Res., 8, 2241–2252 - PMC - PubMed
1. Kim S., Gupta N., Bandeira N., Pevzner P. A. (2009) Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 - PMC - PubMed
1. Kim S., Bandeira N., Pevzner P. A. (2009) Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for de Novo Peptide Sequencing and Identification. Mol. Cell. Proteomics 8, 1391–1400 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources

[1] Ma B., Zhang K., Hendrie C., Liang C., Li M., Doherty-Kirby A., Lajoie G. (2003) PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 - PubMed

[2] Ma B., Zhang K., Hendrie C., Liang C., Li M., Doherty-Kirby A., Lajoie G. (2003) PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 - PubMed

[3] Frank A., Pevzner P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 - PubMed

[4] Frank A., Pevzner P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 - PubMed

[5] Frank A. (2009) A ranking-based Scoring Function for peptide-spectrum matches. J. Proteome Res., 8, 2241–2252 - PMC - PubMed

[6] Frank A. (2009) A ranking-based Scoring Function for peptide-spectrum matches. J. Proteome Res., 8, 2241–2252 - PMC - PubMed

[7] Kim S., Gupta N., Bandeira N., Pevzner P. A. (2009) Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 - PMC - PubMed

[8] Kim S., Gupta N., Bandeira N., Pevzner P. A. (2009) Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 - PMC - PubMed

[9] Kim S., Bandeira N., Pevzner P. A. (2009) Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for de Novo Peptide Sequencing and Identification. Mol. Cell. Proteomics 8, 1391–1400 - PMC - PubMed

[10] Kim S., Bandeira N., Pevzner P. A. (2009) Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for de Novo Peptide Sequencing and Identification. Mol. Cell. Proteomics 8, 1391–1400 - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gapped spectral dictionaries and their applications for database searches of tandem mass spectra

Affiliation

Gapped spectral dictionaries and their applications for database searches of tandem mass spectra

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources