Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007;8(12):R269.
doi: 10.1186/gb-2007-8-12-r269.

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

Affiliations

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

Samuel S Gross et al. Genome Biol. 2007.

Abstract

We describe CONTRAST, a gene predictor which directly incorporates information from multiple alignments rather than employing phylogenetic models. This is accomplished through the use of discriminative machine learning techniques, including a novel training algorithm. We use a two-stage approach, in which a set of binary classifiers designed to recognize coding region boundaries is combined with a global model of gene structure. CONTRAST predicts exact coding region structures for 65% more human genes than the previous state-of-the-art method, misses 46% fewer exons and displays comparable gains in specificity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Start and stop codon classifier accuracy increases as informants are added. The graph shows the generalization accuracy of CONTRAST's start and stop codon classifiers as more informants are added. The x-axis labels indicate the most recently added informant. For example, at the point labeled 'chicken', the informant set consists of mouse, opossum, dog and chicken.
Figure 2
Figure 2
Splice site classifier accuracy increases as informants are added. The graph shows the generalization accuracy of CONTRAST's donor and acceptor splice site classifiers as more informants are added. The x-axis labels indicate the most recently added informant. For example, at the point labeled 'chicken', the informant set consists of mouse, opossum, dog and chicken.
Figure 3
Figure 3
Part of a typical set of input data. The input data consists of 13 rows. The first row contains sequence from the target genome, the second to twelfth rows contain aligned sequence from informant genomes and the last row encodes information about the alignments of ESTs to the target genome.
Figure 4
Figure 4
The structure of labelings in CONTRAST. Each node in the graph is a possible label for a single position in the target sequence. A labeling is legal if it corresponds to a path through the graph.
Figure 5
Figure 5
Features that score a label based on local sequence. CONTRAST contains three types of features for scoring a label based on local sequence: features based on hexamers in the target sequence (shown in blue), features based on a trimer in the target sequence and a trimer in an informant alignment (shown in red) and features based on a position in the EST sequence (shown in green).
Figure 6
Figure 6
Features that score coding region boundaries. CONTRAST contains two types of feature for scoring coding region boundaries. The first, shown in red, maps the output of a classifier to a score using a piecewise linear function learned during CRF training. In this example, the score from the GT splice donor classifier falls between the fourth and fifth control points for the function, with interpolation coefficients of 0.312 and 0.688. The second type of feature, shown in green, scores a coding region boundary based on the EST sequence characters that flank it.

Similar articles

Cited by

References

    1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed
    1. Bernal A, Crammer K, Hatzigeorgiou A, Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol. 2007;3:e54. - PMC - PubMed
    1. Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 2000;10:950–958. - PMC - PubMed
    1. Bafna V, Huson DH. The conserved exon method for gene finding. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. 2000. pp. 3–12. - PubMed
    1. Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001;17(Suppl 1):S140–S149. - PubMed

LinkOut - more resources