Comparative Study

. 2002 Sep;12(9):1418-27.

doi: 10.1101/gr.149502.

GAZE: a generic framework for the integration of gene-prediction data by dynamic programming

Kevin L Howe¹, Tom Chothia, Richard Durbin

Affiliations

PMID: 12213779
PMCID: PMC186661
DOI: 10.1101/gr.149502

Comparative Study

GAZE: a generic framework for the integration of gene-prediction data by dynamic programming

Kevin L Howe et al. Genome Res. 2002 Sep.

. 2002 Sep;12(9):1418-27.

doi: 10.1101/gr.149502.

Authors

Kevin L Howe¹, Tom Chothia, Richard Durbin

Affiliation

¹ The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

PMID: 12213779
PMCID: PMC186661
DOI: 10.1101/gr.149502

Abstract

We describe a method (implemented in a program, GAZE) for assembling arbitrary evidence for individual gene components (features) into predictions of complete gene structures. Our system is generic in that both the features themselves, and the model of gene structure against which potential assemblies are validated and scored, are external to the system and supplied by the user. GAZE uses a dynamic programming algorithm to obtain the highest scoring gene structure according to the model and posterior probabilities that each input feature is part of a gene. A novel pruning strategy ensures that the algorithm has a run-time effectively linear in sequence length. To demonstrate the flexibility of our system in the incorporation of additional evidence into the gene prediction process, we show how it can be used to both represent nonstandard gene structures (in the form of trans-spliced genes in Caenorhabditis elegans), and make use of similarity information (in the form of Expressed Sequence Tag alignments), while requiring no change to the underlying software. GAZE is available at http://www.sanger.ac.uk/Software/analysis/GAZE.

PubMed Disclaimer

Figures

**Figure 1**
A pictorial representation of a GAZE-XML model for multiple genes on both strands. The features are represented by filled boxes, and ’source → target' rules by different types of arrows, each corresponding to a phase constraint as explained in the text. The labeled circles give the name of the length penalty function used for each pair of features, which are themselves defined elsewhere in the configuration file (not shown); the labeled humps indicate the segments that contribute to the score for each pair of features, where “coding” humps are the likely_coding segments referred to in the text. The rules for reverse-strand target features are not shown in their entirety, for clarity, but are formed by a simple reverse complementation of the forward-strand rules. Also omitted are the BEGIN and END features (which mark the two ends of the sequence being searched for genes, and act respectively as source and target to every other feature), as well as the distance, interruption, and DNA constraints explained in the text. The XML configuration file contains a directive to create three separate features for each predicted splice site seen in the GFF file. The effect of this, together with phase constraints between pairs of features giving rise to exons, is to carry forward whether each intron interrupts a codon at position 0, 1, or 2 to the rest of the gene structure, allowing us to ensure that the length of the coding part of each predicted gene is divisible by three.

**Figure 2**
(a) Changes necessary to the standard model in Fig. 1 to allow for *Caenorhabditis elegans trans*-splicing, and (b) the fragment of the XML configuration file affected by the changes.

**Figure 3**
A GAZE model to allow for *trans*-spliced genes and untranslated regions. It is a simple extension of the standard model in Fig. 1, which is shown in pale-shade for reference. The transcript_start and transcript_stop features were not predicted a priori for the practical use of this model, but were derived from the starts and ends of EST alignments (see Methods). The “match”, “intron”, and “span” segments shown are the EST_match, EST_intron, and *EST_span* segments referred to in the text.

**Figure 4**
Posterior feature probabilities and their accuracies. Shown for (a) features part of gene-structures predicted by GAZE and (b) all features given as input to GAZE are the number of features with a posterior probability, pf in each interval (bars), the number of those features that were correct (shaded portions of the bars), and the proportion of those features that were correct (line). These data were calculated for the GAZE_EST model. Plots for other models are similar (data not shown).

**Figure 5**
An AceDB Fmap showing two alternatively spliced isoforms of a gene with WormBase WS52 identifier F54A4.7. The correct gene structures are in blue, with the Fgenesh and GAZE predictions shown in green and red, respectively. Genefinder splice-site predictions are the colored horizontal hooked bars running vertically down the righthand-side of the panel. (a) Both Fgenesh and GAZE fail to correctly identify the initial exon of either isoform. (b) An enlargement of the 5′ ends of the second exons of the two correct gene structures, showing alternative acceptor splice sites. These alternatives are supported by alignments of ESTs to the genome by EST_GENOME (shown in yellow). Although only one of the two alternative acceptors belongs to the predicted gene structure, the posterior feature probabilities reported by GAZE provide evidence for both, as explained in the text.

See this image and copyright information in PMC

Cited by

Exploration of plant genomes in the FLAGdb++ environment.
Dèrozier S, Samson F, Tamby JP, Guichard C, Brunaud V, Grevet P, Gagnot S, Label P, Leplé JC, Lecharny A, Aubourg S. Dèrozier S, et al. Plant Methods. 2011 Mar 29;7:8. doi: 10.1186/1746-4811-7-8. Plant Methods. 2011. PMID: 21447150 Free PMC article.
The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates.
Berthelot C, Brunet F, Chalopin D, Juanchich A, Bernard M, Noël B, Bento P, Da Silva C, Labadie K, Alberti A, Aury JM, Louis A, Dehais P, Bardou P, Montfort J, Klopp C, Cabau C, Gaspin C, Thorgaard GH, Boussaha M, Quillet E, Guyomard R, Galiana D, Bobe J, Volff JN, Genêt C, Wincker P, Jaillon O, Roest Crollius H, Guiguen Y. Berthelot C, et al. Nat Commun. 2014 Apr 22;5:3657. doi: 10.1038/ncomms4657. Nat Commun. 2014. PMID: 24755649 Free PMC article.
Complete DNA sequence of Kuraishia capsulata illustrates novel genomic features among budding yeasts (Saccharomycotina).
Morales L, Noel B, Porcel B, Marcet-Houben M, Hullo MF, Sacerdot C, Tekaia F, Leh-Louis V, Despons L, Khanna V, Aury JM, Barbe V, Couloux A, Labadie K, Pelletier E, Souciet JL, Boekhout T, Gabaldon T, Wincker P, Dujon B. Morales L, et al. Genome Biol Evol. 2013;5(12):2524-39. doi: 10.1093/gbe/evt201. Genome Biol Evol. 2013. PMID: 24317973 Free PMC article.
Sequencing of the smallest Apicomplexan genome from the human pathogen Babesia microti.
Cornillot E, Hadj-Kaddour K, Dassouli A, Noel B, Ranwez V, Vacherie B, Augagneur Y, Brès V, Duclos A, Randazzo S, Carcy B, Debierre-Grockiego F, Delbecq S, Moubri-Ménage K, Shams-Eldin H, Usmani-Brown S, Bringaud F, Wincker P, Vivarès CP, Schwarz RT, Schetters TP, Krause PJ, Gorenflot A, Berry V, Barbe V, Ben Mamoun C. Cornillot E, et al. Nucleic Acids Res. 2012 Oct;40(18):9102-14. doi: 10.1093/nar/gks700. Epub 2012 Jul 24. Nucleic Acids Res. 2012. PMID: 22833609 Free PMC article.
xGDB: open-source computational infrastructure for the integrated evaluation and analysis of genome features.
Schlueter SD, Wilkerson MD, Dong Q, Brendel V. Schlueter SD, et al. Genome Biol. 2006;7(11):R111. doi: 10.1186/gb-2006-7-11-r111. Genome Biol. 2006. PMID: 17116260 Free PMC article.

See all "Cited by" articles

References

1. Blumenthal T, Steward K. RNA processing and gene structure. In: Riddle DL, et al., editors. C. elegans II. Cold Spring Harbor, New York: Cold Spring Harbor Laboratory Press; 1997. pp. 117–145. - PubMed
1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed
1. Burge C, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998;8:346–354. - PubMed
1. Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis: Probabilistic models of proteins and nucleic acids (Chapter 3). Cambridge, UK: Cambridge University Press; 1998.
1. Guigó R. Assembling genes from predicted exons in linear time with dynamic programming. J Comp Biol. 1998;5:681–702. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GAZE: a generic framework for the integration of gene-prediction data by dynamic programming

Affiliation

GAZE: a generic framework for the integration of gene-prediction data by dynamic programming

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases