Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Dec 1;75(23):6415-21.
doi: 10.1021/ac0347462.

GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model

Affiliations

GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model

David L Tabb et al. Anal Chem. .

Abstract

Shotgun proteomics is a powerful tool for identifying the protein content of complex mixtures via liquid chromatography and tandem mass spectrometry. The most widely used class of algorithms for analyzing mass spectra of peptides has been database search software such as SEQUEST. A new sequence tag database search algorithm, called GutenTag, makes it possible to identify peptides with unknown posttranslational modifications or sequence variations. This software automates the process of inferring partial sequence "tags" directly from the spectrum and efficiently examines a sequence database for peptides that match these tags. When multiple candidate sequences result from the database search, the software evaluates which is the best match by a rapid examination of spectral fragment ions. We compare GutenTag's accuracy to that of SEQUEST on a defined protein mixture, showing that both modified and unmodified peptides can be successfully identified by this approach. GutenTag analyzed 33,000 spectra from a human lens sample, identifying peptides that were missed in prior SEQUEST analysis due to sequence polymorphisms and posttranslational modifications. The software is available under license; visit http://fields.scripps.edu for information.

PubMed Disclaimer

Figures

Figure 1
Figure 1
GutenTag procedure. GutenTag infers short sequences directly from each spectrum. The best sequence tags are sought in a sequence database to find peptide sequences that match a tag sequence and at least one flanking sequence mass. These partial and complete peptide sequences are ranked by dot-product score. In the above example, three tag sequences match to the same complete peptide sequence. This complete sequence scores more highly than the other sequences.
Figure 2
Figure 2
Algorithm comparisons for complete sequences. Both SEQUEST and GutenTag score true identifications (solid line) more highly than false identifications (dotted line), but the degree of score overlap for complete sequences in reduced in GutenTag. Because GutenTag produces a lower number of true identifications than SEQUEST (see Table 1), GutenTag’s improved separation between true and false identification scores is important in achieving parity between the algorithms. Because GutenTag’s intensity model was trained on tryptic spectra, it achieves better separation for tryptic peptides than for proteinase K peptides.
Figure 3
Figure 3
Partial sequence comparisons. Partial sequences predict fewer peaks than do complete sequences, and so GutenTag does not separate true identifications from false identifications as effectively in partial sequences. In the results for the proteinase K digest, the separation is almost nonexistent due to GutenTag’s trypsin-specific fragmentation model. GutenTag’s partial sequence identifications should be viewed as a basis upon which to build complete sequence identifications (including whatever posttranslational modifications are necessary) rather than final answers in themselves.
Figure 4
Figure 4
Comparison of true and false positives. ROC curves compare the number of true positives to the number of false positives for various cutoffs for a particular scoring function. The above plots analyze the complete sequence identifications from GutenTag (solid lines) and SEQUEST (dashed lines). Because GutenTag can examine partial sequences in addition to complete sequences to explain each spectrum, the number of complete sequences it proposes is significantly less than SEQUEST (see Table 1). GutenTag’s improved separation between true and false complete identifications is shown by its producing a curve more similar to a right angle, particularly in tryptic spectra. To make use of SEQUEST’s greater numbers of correct sequences, one must lower cutoff scores to a point that substantial numbers of false positive sequences also pass the cutoff; the dotted diagonals indicate the points at which the numbers of false positives equal the true positives.

Similar articles

Cited by

References

    1. Hunt DF, Yates JR, III, Shabanowitz J, Winston S, Hauer CR. Proc Natl Acad Sci USA. 1986;83:6233–6237. - PMC - PubMed
    1. Eng JK, McCormack AL, Yates JR., III J Am Soc Mass Spectrom. 1995;67:1426–1436. - PubMed
    1. Link AJ. Trends Biotechnol. 2002;20:S8–13. - PubMed
    1. Washburn MP, Wolters D, Yates JR., III Nat Biotechnol. 2001;19:242–247. - PubMed
    1. Mann M, Jensen ON. Nat Biotechnol. 2003;21:255–261. - PubMed

Publication types

MeSH terms