Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun;10(6):M110.002220.
doi: 10.1074/mcp.M110.002220. Epub 2011 Mar 28.

Gapped spectral dictionaries and their applications for database searches of tandem mass spectra

Affiliations

Gapped spectral dictionaries and their applications for database searches of tandem mass spectra

Kyowon Jeong et al. Mol Cell Proteomics. 2011 Jun.

Abstract

Generating all plausible de novo interpretations of a peptide tandem mass (MS/MS) spectrum (Spectral Dictionary) and quickly matching them against the database represent a recently emerged alternative approach to peptide identification. However, the sizes of the Spectral Dictionaries quickly grow with the peptide length making their generation impractical for long peptides. We introduce Gapped Spectral Dictionaries (all plausible de novo interpretations with gaps) that can be easily generated for any peptide length thus addressing the limitation of the Spectral Dictionary approach. We show that Gapped Spectral Dictionaries are small thus opening a possibility of using them to speed-up MS/MS searches. Our MS-Gapped-Dictionary algorithm (based on Gapped Spectral Dictionaries) enables proteogenomics applications (such as searches in the six-frame translation of the human genome) that are prohibitively time consuming with existing approaches. MS-Gapped-Dictionary generates gapped peptides that occupy a niche between accurate but short peptide sequence tags and long but inaccurate full length peptide reconstructions. We show that, contrary to conventional wisdom, some high-quality spectra do not have good peptide sequence tags and introduce gapped tags that have advantages over the conventional peptide sequence tags in MS/MS database searches.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Spectra for the peptide LNRVSQGK (A) and AIIDAIVSGELK (B) identified by InsPecT (release 20090910) database search.
Fig. 2.
Fig. 2.
Different modules of MS-gappeddictionary.
Fig. 3.
Fig. 3.
Left panel: Illustration of the dynamic programming algorithm for computing the generating function of graph G shown in (A). The nodes of the dynamic programming (DP) graph (B) are defined as pairs (v,x), where v is a vertex of G and x is a score. Two nodes (v,x) and (v′,x′) are connected by an edge if and only if there exists an edge between vertices v and v′ in G with score x′-x. The probability of an edge between (v,x) and (v′,x′) in the DP graph equals to the probability of the edge (v,v′) in G. A source s in graph G corresponds to a single node (s,0) in the DP graph. A node (v,x) is present in the DP graph if and only if there exist a path from (s,0) to (v,x). In this example, red (blue) edges of the DP graph in (B) are from the red (blue) edges of the graph G in (A). All edge probabilities in (B) are 0.5 as the probabilities of edges of G are 0.5. The node probability of node (v,x) (shown inside nodes in (B) and (C)) is the total probability of the paths from the source s to v with the score x. The node probability of the source of the DP graph is initialized by 1, and the node probability of a node (v,x) is obtained by the weighted summation of the node probabilities of its predecessors (see (21)). The generating function is represented by the probabilities of the sink nodes in the DP graph. To find all paths of score x from the source to the sink in graph G one has to backtrack all paths from the node (t,x) in the DP graph. For example, if x = 2, two such paths are found: {s, v2, v4, v7, t} and {s, v3, v6, t} as in (C). Right panel: Path Dictionary and Gapped Path Dictionary. (A) PD(G,1) and the generating function of G. (B) The construction of GH using edges between hubs v2 and t (shown as solid blue and red edges) as examples. Solid blue and red edges in GH are induced by dashed blue and red paths in G. All paths that use only nonhub vertices in G are collapsed into edges in GH. (C) The hub graph GH, GPD(G,H,1), and the generating function of GH.
Fig. 4.
Fig. 4.
Gapped Spectral Dictionary size versus Spectral Dictionary size (for varying peptide length and number of hubs) for the Shewanella data set.
Fig. 5.
Fig. 5.
Distribution of the lengths of the gapped peptides induced by correct peptides (for 20 hubs) for the Shewanella data set. (see Supplement Fig. S2 for different parameters).
Fig. 6.
Fig. 6.
Identifiability of the δ-reduced Gapped Spectral Dictionaries from the Shewanella data set for δ = 5 (A), δ = 7 (B), and δ = 9 (C).
Fig. 7.
Fig. 7.
Average rank of (the best ranked) correct gapped peptides. The average ranking does not exceed 80 regardless of the peptide length (for δ = 5,7,9). The number of hubs is 20. The dotted lines with open circles at the ends represent the range that the rankings fall into 90% of the time.
Fig. 8.
Fig. 8.
The probability that a correct gapped peptide is found within k top-ranked peptides in the δ-reduced Gapped Spectral Dictionary. The number of hubs is 20, and δ = 5 (see Supplement Fig. S3 for different parameters).
Fig. 9.
Fig. 9.
Identifiability of the Pocket Dictionaries from the Shewanella data set for δ = 5 (A), δ = 7 (B), and δ = 9 (C). The number of hubs is 20. Even for long peptides, Pocket Dictionaries with 50 gapped peptides are sufficient to ensure the identifiability higher than 97% when δ is 5. When δ is large, larger Pocket Dictionaries are needed. (see Fig S4 in Supplemental).
Fig. 10.
Fig. 10.
Comparison of gapped tags generated from the Pocket Dictionaries and the peptide sequence tags generated by InsPecT (on spectra from the Standard data set).
Fig. 11.
Fig. 11.
The FDR curves for MS-GappedDictionary (using either gapped tag or gapped peptides), OMSSA, InsPecT, and MS-Dictionary (peptide-level FDR is reported (32)). For each spectrum, only the single best matching peptide is reported.
Fig. 12.
Fig. 12.
The length distribution of peptides with the spectral probability less than 10–13 (corresponding FDR ≈1%) in HEK data set identified by MS-GappedDictionary and MS-Dictionary in the six-frame translation of the human genome. MS-Dictionary identifies less peptides than MS-GappedDictionary when the peptide length is longer than 13.

Similar articles

Cited by

References

    1. Ma B., Zhang K., Hendrie C., Liang C., Li M., Doherty-Kirby A., Lajoie G. (2003) PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 - PubMed
    1. Frank A., Pevzner P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 - PubMed
    1. Frank A. (2009) A ranking-based Scoring Function for peptide-spectrum matches. J. Proteome Res., 8, 2241–2252 - PMC - PubMed
    1. Kim S., Gupta N., Bandeira N., Pevzner P. A. (2009) Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 - PMC - PubMed
    1. Kim S., Bandeira N., Pevzner P. A. (2009) Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for de Novo Peptide Sequencing and Identification. Mol. Cell. Proteomics 8, 1391–1400 - PMC - PubMed

Publication types

Substances

LinkOut - more resources