. 2007;8(12):R269.

doi: 10.1186/gb-2007-8-12-r269.

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

Samuel S Gross¹, Chuong B Do, Marina Sirota, Serafim Batzoglou

Affiliations

PMID: 18096039
PMCID: PMC2246271
DOI: 10.1186/gb-2007-8-12-r269

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

Samuel S Gross et al. Genome Biol. 2007.

. 2007;8(12):R269.

doi: 10.1186/gb-2007-8-12-r269.

Authors

Samuel S Gross¹, Chuong B Do, Marina Sirota, Serafim Batzoglou

Affiliation

¹ Computer Science Department, Stanford University, Stanford, CA, USA. ssgross@cs.stanford.edu

PMID: 18096039
PMCID: PMC2246271
DOI: 10.1186/gb-2007-8-12-r269

Abstract

We describe CONTRAST, a gene predictor which directly incorporates information from multiple alignments rather than employing phylogenetic models. This is accomplished through the use of discriminative machine learning techniques, including a novel training algorithm. We use a two-stage approach, in which a set of binary classifiers designed to recognize coding region boundaries is combined with a global model of gene structure. CONTRAST predicts exact coding region structures for 65% more human genes than the previous state-of-the-art method, misses 46% fewer exons and displays comparable gains in specificity.

PubMed Disclaimer

Figures

**Figure 1**
**Start and stop codon classifier accuracy increases as informants are added**. The graph shows the generalization accuracy of CONTRAST's start and stop codon classifiers as more informants are added. The x-axis labels indicate the most recently added informant. For example, at the point labeled 'chicken', the informant set consists of mouse, opossum, dog and chicken.

**Figure 2**
**Splice site classifier accuracy increases as informants are added**. The graph shows the generalization accuracy of CONTRAST's donor and acceptor splice site classifiers as more informants are added. The x-axis labels indicate the most recently added informant. For example, at the point labeled 'chicken', the informant set consists of mouse, opossum, dog and chicken.

**Figure 3**
**Part of a typical set of input data**. The input data consists of 13 rows. The first row contains sequence from the target genome, the second to twelfth rows contain aligned sequence from informant genomes and the last row encodes information about the alignments of ESTs to the target genome.

**Figure 4**
**The structure of labelings in CONTRAST**. Each node in the graph is a possible label for a single position in the target sequence. A labeling is legal if it corresponds to a path through the graph.

**Figure 5**
**Features that score a label based on local sequence**. CONTRAST contains three types of features for scoring a label based on local sequence: features based on hexamers in the target sequence (shown in blue), features based on a trimer in the target sequence and a trimer in an informant alignment (shown in red) and features based on a position in the EST sequence (shown in green).

**Figure 6**
**Features that score coding region boundaries**. CONTRAST contains two types of feature for scoring coding region boundaries. The first, shown in red, maps the output of a classifier to a score using a piecewise linear function learned during CRF training. In this example, the score from the GT splice donor classifier falls between the fourth and fifth control points for the function, with interpolation coefficients of 0.312 and 0.688. The second type of feature, shown in green, scores a coding region boundary based on the EST sequence characters that flank it.

See this image and copyright information in PMC

Cited by

Generalized centroid estimators in bioinformatics.
Hamada M, Kiryu H, Iwasaki W, Asai K. Hamada M, et al. PLoS One. 2011 Feb 18;6(2):e16450. doi: 10.1371/journal.pone.0016450. PLoS One. 2011. PMID: 21365017 Free PMC article.
mGene: accurate SVM-based gene finding with an application to nematode genomes.
Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, Krüger N, Sonnenburg S, Rätsch G. Schweikert G, et al. Genome Res. 2009 Nov;19(11):2133-43. doi: 10.1101/gr.090597.108. Epub 2009 Jun 29. Genome Res. 2009. PMID: 19564452 Free PMC article.
ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection.
Alioto T, Picardi E, Guigó R, Pesole G. Alioto T, et al. Biomed Res Int. 2013;2013:502827. doi: 10.1155/2013/502827. Epub 2013 Nov 7. Biomed Res Int. 2013. PMID: 24308000 Free PMC article.
Whole-Genome Alignment and Comparative Annotation.
Armstrong J, Fiddes IT, Diekhans M, Paten B. Armstrong J, et al. Annu Rev Anim Biosci. 2019 Feb 15;7:41-64. doi: 10.1146/annurev-animal-020518-115005. Epub 2018 Oct 31. Annu Rev Anim Biosci. 2019. PMID: 30379572 Free PMC article. Review.
Simultaneous gene finding in multiple genomes.
König S, Romoth LW, Gerischer L, Stanke M. König S, et al. Bioinformatics. 2016 Nov 15;32(22):3388-3395. doi: 10.1093/bioinformatics/btw494. Epub 2016 Jul 27. Bioinformatics. 2016. PMID: 27466621 Free PMC article.

See all "Cited by" articles

References

1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed
1. Bernal A, Crammer K, Hatzigeorgiou A, Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol. 2007;3:e54. - PMC - PubMed
1. Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 2000;10:950–958. - PMC - PubMed
1. Bafna V, Huson DH. The conserved exon method for gene finding. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. 2000. pp. 3–12. - PubMed
1. Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001;17(Suppl 1):S140–S149. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

T15 LM007033/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

Affiliation

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources