Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2002 Sep;12(9):1418-27.
doi: 10.1101/gr.149502.

GAZE: a generic framework for the integration of gene-prediction data by dynamic programming

Affiliations
Comparative Study

GAZE: a generic framework for the integration of gene-prediction data by dynamic programming

Kevin L Howe et al. Genome Res. 2002 Sep.

Abstract

We describe a method (implemented in a program, GAZE) for assembling arbitrary evidence for individual gene components (features) into predictions of complete gene structures. Our system is generic in that both the features themselves, and the model of gene structure against which potential assemblies are validated and scored, are external to the system and supplied by the user. GAZE uses a dynamic programming algorithm to obtain the highest scoring gene structure according to the model and posterior probabilities that each input feature is part of a gene. A novel pruning strategy ensures that the algorithm has a run-time effectively linear in sequence length. To demonstrate the flexibility of our system in the incorporation of additional evidence into the gene prediction process, we show how it can be used to both represent nonstandard gene structures (in the form of trans-spliced genes in Caenorhabditis elegans), and make use of similarity information (in the form of Expressed Sequence Tag alignments), while requiring no change to the underlying software. GAZE is available at http://www.sanger.ac.uk/Software/analysis/GAZE.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A pictorial representation of a GAZE-XML model for multiple genes on both strands. The features are represented by filled boxes, and ’source → target' rules by different types of arrows, each corresponding to a phase constraint as explained in the text. The labeled circles give the name of the length penalty function used for each pair of features, which are themselves defined elsewhere in the configuration file (not shown); the labeled humps indicate the segments that contribute to the score for each pair of features, where “coding” humps are the likely_coding segments referred to in the text. The rules for reverse-strand target features are not shown in their entirety, for clarity, but are formed by a simple reverse complementation of the forward-strand rules. Also omitted are the BEGIN and END features (which mark the two ends of the sequence being searched for genes, and act respectively as source and target to every other feature), as well as the distance, interruption, and DNA constraints explained in the text. The XML configuration file contains a directive to create three separate features for each predicted splice site seen in the GFF file. The effect of this, together with phase constraints between pairs of features giving rise to exons, is to carry forward whether each intron interrupts a codon at position 0, 1, or 2 to the rest of the gene structure, allowing us to ensure that the length of the coding part of each predicted gene is divisible by three.
Figure 2
Figure 2
(a) Changes necessary to the standard model in Fig. 1 to allow for Caenorhabditis elegans trans-splicing, and (b) the fragment of the XML configuration file affected by the changes.
Figure 3
Figure 3
A GAZE model to allow for trans-spliced genes and untranslated regions. It is a simple extension of the standard model in Fig. 1, which is shown in pale-shade for reference. The transcript_start and transcript_stop features were not predicted a priori for the practical use of this model, but were derived from the starts and ends of EST alignments (see Methods). The “match”, “intron”, and “span” segments shown are the EST_match, EST_intron, and EST_span segments referred to in the text.
Figure 4
Figure 4
Posterior feature probabilities and their accuracies. Shown for (a) features part of gene-structures predicted by GAZE and (b) all features given as input to GAZE are the number of features with a posterior probability, pf in each interval (bars), the number of those features that were correct (shaded portions of the bars), and the proportion of those features that were correct (line). These data were calculated for the GAZE_EST model. Plots for other models are similar (data not shown).
Figure 5
Figure 5
An AceDB Fmap showing two alternatively spliced isoforms of a gene with WormBase WS52 identifier F54A4.7. The correct gene structures are in blue, with the Fgenesh and GAZE predictions shown in green and red, respectively. Genefinder splice-site predictions are the colored horizontal hooked bars running vertically down the righthand-side of the panel. (a) Both Fgenesh and GAZE fail to correctly identify the initial exon of either isoform. (b) An enlargement of the 5′ ends of the second exons of the two correct gene structures, showing alternative acceptor splice sites. These alternatives are supported by alignments of ESTs to the genome by EST_GENOME (shown in yellow). Although only one of the two alternative acceptors belongs to the predicted gene structure, the posterior feature probabilities reported by GAZE provide evidence for both, as explained in the text.

Similar articles

Cited by

References

    1. Blumenthal T, Steward K. RNA processing and gene structure. In: Riddle DL, et al., editors. C. elegans II. Cold Spring Harbor, New York: Cold Spring Harbor Laboratory Press; 1997. pp. 117–145. - PubMed
    1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed
    1. Burge C, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998;8:346–354. - PubMed
    1. Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis: Probabilistic models of proteins and nucleic acids (Chapter 3). Cambridge, UK: Cambridge University Press; 1998.
    1. Guigó R. Assembling genes from predicted exons in linear time with dynamic programming. J Comp Biol. 1998;5:681–702. - PubMed

Publication types