Comparative Study

. 2007 Feb 23;3(2):e20.

doi: 10.1371/journal.pcbi.0030020. Epub 2006 Dec 21.

Improving the Caenorhabditis elegans genome annotation using machine learning

Gunnar Rätsch¹, Sören Sonnenburg, Jagan Srinivasan, Hanh Witte, Klaus-R Müller, Ralf-J Sommer, Bernhard Schölkopf

Affiliations

PMID: 17319737
PMCID: PMC1808025
DOI: 10.1371/journal.pcbi.0030020

Comparative Study

Improving the Caenorhabditis elegans genome annotation using machine learning

Gunnar Rätsch et al. PLoS Comput Biol. 2007.

. 2007 Feb 23;3(2):e20.

doi: 10.1371/journal.pcbi.0030020. Epub 2006 Dec 21.

Authors

Gunnar Rätsch¹, Sören Sonnenburg, Jagan Srinivasan, Hanh Witte, Klaus-R Müller, Ralf-J Sommer, Bernhard Schölkopf

Affiliation

¹ Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany. Gunnar.Raetsch@tuebingen.mpg.de

PMID: 17319737
PMCID: PMC1808025
DOI: 10.1371/journal.pcbi.0030020

Abstract

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

PubMed Disclaimer

Conflict of interest statement

Competing interests. GR, SS, KRM, and BS are authors of a patent application (PCT WO05116246) related to the technical innovations of the proposed method.

Figures

**Figure 1. Simplified Support Vector Machine**
Learn a function f such that the difference of predictions (the *margin*) of positively and negatively labeled examples is maximal. Previously unseen examples will often be close to the training examples. The large margin then ensures that these examples are correctly classified as well, i.e., the decision rule *generalizes*.

Figure 2. Given Two Sequences, s ₁ and s ₂ of Equal Length, Our Kernel Consists of a Weighted Sum to Which Each Match in the Sequences Makes a Contribution w _l Depending on Its Length l, Where Longer Matches Contribute More Significantly
For predictions, we use a window of 140 nt around the potential splice site (cf. Materials and Methods for details, including the procedure of how the length of the window is determined).

Figure 3. Given the Start of the First and the End of the Last Exon, Our System (*mSplicer*) First Scans the Sequence Using SVM Detectors Trained To Recognize Donor (SVM_GY) and Acceptor (SVM_AG) Splice Sites
The detectors assign a score to each candidate site, shown below the sequence. In combination with additional information including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (unpublished data), these splice site scores contribute to the cumulative score for a putative splicing isoform. The bottom graph (step 2) illustrates the computation of the cumulative scores for two splicing isoforms, where the score at end of the sequence is the final score of the isoform. The contributions of the individual detector outputs, lengths of segments, as well as properties of the segments to the score are adjusted during training. They are optimized such that the *margin* between the true splicing isoform (shown in blue) and all other (wrong) isoforms (one of them is shown in red) is maximized. Prediction of new sequences works by selecting the splicing isoform with the maximum cumulative score. This can be implemented using dynamic programming related to decoding generalized HMMs 12, which also allows one to enforce certain constraints on the isoform (e.g., an open reading frame).

**Figure 4. An Elementary State Model for Unspliced mRNA**
The 5′ end of the transcript is either directly followed by the 3′ end (single exon gene) or by an arbitrary number of donor–acceptor splice site pairs exhibiting the GT/GC and AG dimmer. A transition in this state model corresponds to *accepting* a whole segment (as in generalized HMMs 12), i.e., an exon or intron, with the corresponding dimer at the 3′ boundary of the segment (except in state 4).

**Figure 5. The State Model That Uses Open Reading Frame Information**
The sequences next to the state indicate which consensus has to appear at the transitions between intron (capital) and exon (bold). Here, we use the IUPAC code for ambiguous nucleotides (e.g., B = C/G/T, R = A/G, Y = C/T). The digit on the transition arrows is related to the reading frame and indicates the required frame shift to follow the transition (e.g., between state 1 and 2, one can only accept exons leading to a frame shift of 0). Also, it defines in which frame stop codons are allowed to occur—no stop codon should appear in-frame. Finally, the model is constructed such that in-frame stop codons cannot be assembled on the exon boundaries (this required the three additional state pairs 6/7, 10/11, and 12/13).

**Figure 6. POIMs for Donor (Left) and Acceptor (Right) SVM Classifiers**
Shown are the color-coded importance scores of substring lengths for positions around the splice sites. Near the splice site, many important oligomers are identified. Particularly long substrings are important upstream of the donor and downstream of the acceptor site. See the main text for discussion.

See this image and copyright information in PMC

References

1. Harris T, Chen N, Cunningham F, et al. Wormbase: A multi-species resource for nematode biology and genomics. Nucleic Acids Res. 2004;32:D411–D417. - PMC - PubMed
1. The Caenorhabditis elegans sequencing consortium. Genome sequence of the Nematode Caenorhabditis elegans. A platform for investigating biology. Science. 1998;282:2012–2018. - PubMed
1. Schwarz E, Antoshechkin I, Bastiani C, et al. Wormbase: Better software, richer content. Nucleic Acids Res. 2006;34:D475–D478. - PMC - PubMed
1. Vapnik V. The nature of statistical learning theory. New York: Springer Verlag; 1995.
1. Schölkopf B, Smola AJ. Learning with kernels. Cambridge (Massachusetts): MIT Press; 2002.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving the Caenorhabditis elegans genome annotation using machine learning

Affiliation

Improving the Caenorhabditis elegans genome annotation using machine learning

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources