GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model

David L Tabb¹, Anita Saraf, John R Yates 3rd

Affiliations

PMID: 14640709
PMCID: PMC2915448
DOI: 10.1021/ac0347462

GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model

David L Tabb et al. Anal Chem. 2003.

. 2003 Dec 1;75(23):6415-21.

doi: 10.1021/ac0347462.

Authors

David L Tabb¹, Anita Saraf, John R Yates 3rd

Affiliation

¹ SR11 Department of Cell Biology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037, USA.

PMID: 14640709
PMCID: PMC2915448
DOI: 10.1021/ac0347462

Abstract

Shotgun proteomics is a powerful tool for identifying the protein content of complex mixtures via liquid chromatography and tandem mass spectrometry. The most widely used class of algorithms for analyzing mass spectra of peptides has been database search software such as SEQUEST. A new sequence tag database search algorithm, called GutenTag, makes it possible to identify peptides with unknown posttranslational modifications or sequence variations. This software automates the process of inferring partial sequence "tags" directly from the spectrum and efficiently examines a sequence database for peptides that match these tags. When multiple candidate sequences result from the database search, the software evaluates which is the best match by a rapid examination of spectral fragment ions. We compare GutenTag's accuracy to that of SEQUEST on a defined protein mixture, showing that both modified and unmodified peptides can be successfully identified by this approach. GutenTag analyzed 33,000 spectra from a human lens sample, identifying peptides that were missed in prior SEQUEST analysis due to sequence polymorphisms and posttranslational modifications. The software is available under license; visit http://fields.scripps.edu for information.

PubMed Disclaimer

Figures

**Figure 1**
GutenTag procedure. GutenTag infers short sequences directly from each spectrum. The best sequence tags are sought in a sequence database to find peptide sequences that match a tag sequence and at least one flanking sequence mass. These partial and complete peptide sequences are ranked by dot-product score. In the above example, three tag sequences match to the same complete peptide sequence. This complete sequence scores more highly than the other sequences.

**Figure 2**
Algorithm comparisons for complete sequences. Both SEQUEST and GutenTag score true identifications (solid line) more highly than false identifications (dotted line), but the degree of score overlap for complete sequences in reduced in GutenTag. Because GutenTag produces a lower number of true identifications than SEQUEST (see Table 1), GutenTag’s improved separation between true and false identification scores is important in achieving parity between the algorithms. Because GutenTag’s intensity model was trained on tryptic spectra, it achieves better separation for tryptic peptides than for proteinase K peptides.

**Figure 3**
Partial sequence comparisons. Partial sequences predict fewer peaks than do complete sequences, and so GutenTag does not separate true identifications from false identifications as effectively in partial sequences. In the results for the proteinase K digest, the separation is almost nonexistent due to GutenTag’s trypsin-specific fragmentation model. GutenTag’s partial sequence identifications should be viewed as a basis upon which to build complete sequence identifications (including whatever posttranslational modifications are necessary) rather than final answers in themselves.

**Figure 4**
Comparison of true and false positives. ROC curves compare the number of true positives to the number of false positives for various cutoffs for a particular scoring function. The above plots analyze the complete sequence identifications from GutenTag (solid lines) and SEQUEST (dashed lines). Because GutenTag can examine partial sequences in addition to complete sequences to explain each spectrum, the number of complete sequences it proposes is significantly less than SEQUEST (see Table 1). GutenTag’s improved separation between true and false complete identifications is shown by its producing a curve more similar to a right angle, particularly in tryptic spectra. To make use of SEQUEST’s greater numbers of correct sequences, one must lower cutoff scores to a point that substantial numbers of false positive sequences also pass the cutoff; the dotted diagonals indicate the points at which the numbers of false positives equal the true positives.

See this image and copyright information in PMC

References

1. Hunt DF, Yates JR, III, Shabanowitz J, Winston S, Hauer CR. Proc Natl Acad Sci USA. 1986;83:6233–6237. - PMC - PubMed
1. Eng JK, McCormack AL, Yates JR., III J Am Soc Mass Spectrom. 1995;67:1426–1436. - PubMed
1. Link AJ. Trends Biotechnol. 2002;20:S8–13. - PubMed
1. Washburn MP, Wolters D, Yates JR., III Nat Biotechnol. 2001;19:242–247. - PubMed
1. Mann M, Jensen ON. Nat Biotechnol. 2003;21:255–261. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 EY13288-03/EY/NEI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model

Affiliation

GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources