Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2007 Feb 23;3(2):e20.
doi: 10.1371/journal.pcbi.0030020. Epub 2006 Dec 21.

Improving the Caenorhabditis elegans genome annotation using machine learning

Affiliations
Comparative Study

Improving the Caenorhabditis elegans genome annotation using machine learning

Gunnar Rätsch et al. PLoS Comput Biol. .

Abstract

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

PubMed Disclaimer

Conflict of interest statement

Competing interests. GR, SS, KRM, and BS are authors of a patent application (PCT WO05116246) related to the technical innovations of the proposed method.

Figures

Figure 1
Figure 1. Simplified Support Vector Machine
Learn a function f such that the difference of predictions (the margin) of positively and negatively labeled examples is maximal. Previously unseen examples will often be close to the training examples. The large margin then ensures that these examples are correctly classified as well, i.e., the decision rule generalizes.
Figure 2
Figure 2. Given Two Sequences, s 1 and s 2 of Equal Length, Our Kernel Consists of a Weighted Sum to Which Each Match in the Sequences Makes a Contribution w l Depending on Its Length l, Where Longer Matches Contribute More Significantly
For predictions, we use a window of 140 nt around the potential splice site (cf. Materials and Methods for details, including the procedure of how the length of the window is determined).
Figure 3
Figure 3. Given the Start of the First and the End of the Last Exon, Our System (mSplicer) First Scans the Sequence Using SVM Detectors Trained To Recognize Donor (SVMGY) and Acceptor (SVMAG) Splice Sites
The detectors assign a score to each candidate site, shown below the sequence. In combination with additional information including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (unpublished data), these splice site scores contribute to the cumulative score for a putative splicing isoform. The bottom graph (step 2) illustrates the computation of the cumulative scores for two splicing isoforms, where the score at end of the sequence is the final score of the isoform. The contributions of the individual detector outputs, lengths of segments, as well as properties of the segments to the score are adjusted during training. They are optimized such that the margin between the true splicing isoform (shown in blue) and all other (wrong) isoforms (one of them is shown in red) is maximized. Prediction of new sequences works by selecting the splicing isoform with the maximum cumulative score. This can be implemented using dynamic programming related to decoding generalized HMMs 12, which also allows one to enforce certain constraints on the isoform (e.g., an open reading frame).
Figure 4
Figure 4. An Elementary State Model for Unspliced mRNA
The 5′ end of the transcript is either directly followed by the 3′ end (single exon gene) or by an arbitrary number of donor–acceptor splice site pairs exhibiting the GT/GC and AG dimmer. A transition in this state model corresponds to accepting a whole segment (as in generalized HMMs 12), i.e., an exon or intron, with the corresponding dimer at the 3′ boundary of the segment (except in state 4).
Figure 5
Figure 5. The State Model That Uses Open Reading Frame Information
The sequences next to the state indicate which consensus has to appear at the transitions between intron (capital) and exon (bold). Here, we use the IUPAC code for ambiguous nucleotides (e.g., B = C/G/T, R = A/G, Y = C/T). The digit on the transition arrows is related to the reading frame and indicates the required frame shift to follow the transition (e.g., between state 1 and 2, one can only accept exons leading to a frame shift of 0). Also, it defines in which frame stop codons are allowed to occur—no stop codon should appear in-frame. Finally, the model is constructed such that in-frame stop codons cannot be assembled on the exon boundaries (this required the three additional state pairs 6/7, 10/11, and 12/13).
Figure 6
Figure 6. POIMs for Donor (Left) and Acceptor (Right) SVM Classifiers
Shown are the color-coded importance scores of substring lengths for positions around the splice sites. Near the splice site, many important oligomers are identified. Particularly long substrings are important upstream of the donor and downstream of the acceptor site. See the main text for discussion.

Similar articles

Cited by

References

    1. Harris T, Chen N, Cunningham F, et al. Wormbase: A multi-species resource for nematode biology and genomics. Nucleic Acids Res. 2004;32:D411–D417. - PMC - PubMed
    1. The Caenorhabditis elegans sequencing consortium. Genome sequence of the Nematode Caenorhabditis elegans. A platform for investigating biology. Science. 1998;282:2012–2018. - PubMed
    1. Schwarz E, Antoshechkin I, Bastiani C, et al. Wormbase: Better software, richer content. Nucleic Acids Res. 2006;34:D475–D478. - PMC - PubMed
    1. Vapnik V. The nature of statistical learning theory. New York: Springer Verlag; 1995.
    1. Schölkopf B, Smola AJ. Learning with kernels. Cambridge (Massachusetts): MIT Press; 2002.

Publication types