Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jul 1;24(13):i6-14.
doi: 10.1093/bioinformatics/btn170.

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors

Affiliations

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors

Sören Sonnenburg et al. Bioinformatics. .

Abstract

Motivation: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.

Results: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena.

Availability: All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Example of WD kernel of order K=3. In the shown case,k(x, x′)=21β1+8β2+3β3.
Fig. 2.
Fig. 2.
Substrings, superstrings, left partial overlaps and right partial overlaps: definition and examples for the string AATACGTAC).
Fig. 3.
Fig. 3.
The theorem which enables efficient POIM computation for zeroth-order Markov chains (Zien et al., 2007).
Fig. 4.
Fig. 4.
Two (k−1)-mers are covered by a k-mer.
Fig. 5.
Fig. 5.
Comparison of different visualization techniques for the fixed-position-motifs experiment. Motifs GATTACA and AGTAGTG were inserted at positions 10 and 30 respectively with growing level of mutation (i.e. number of nucleotides randomly substituted in the motifs) from left to right. SVMs classifiers were trained to distinguish random sequences from sequences with the (mutated) motifs GATTACA and AGTAGT inserted. (A–D) We computed Differential POIMs [Equation (9)] for up to 8mers, from a WD-kernel SVM of order 20. Here each figure displays the importance of k-mer lengths (y-axis) for k=1 … 8 at each position (x-axis) (i=1 … 50 as a heat map. Red and yellow color denotes relevant motifs, dark blue corresponds to motifs not conveying information about the problem. 1mers are at the bottom of the plot, 8mers at the top. (E–H) K-mer scoring overview (SVM-w) was computed using the same setup as for differential POIMs. The SVM-w is again displayed as a heat map. (I–L) It was obtained using MKL (averaged weighting obtained using 100 bootstrap runs, (Rätsch et al., 2006). Again the res0ult is displayed as a heat map, but for 1-to 7mers only. For a more detailed discussion see text.
Fig. 6.
Fig. 6.
(A) versus sequence logo (B) for motifs GATTACA and AGTAGTG at positions 10 and 30, respectively, with 4-out-of-7 mutations in the motifs.
Fig. 7.
Fig. 7.
Comparison of different visualization techniques for the varying-positions-motif experiment. The mutated motif GATTACA was inserted at positions 0+−13 in uniformly distributed sequences. (A–C) shows the Differential POIM matrices [cf. Equation (9)] as a heat map, the POIM weight mass for different k=1 … 8 and the POIM k-mer diversity for k=3 as a heat map; (D–F) shows the SVM-w overview plot as a heat map, the SVM-w weight mass also for k=1 … 8 and the k−mer diversity for k=3 as a heat map; (G) sequence logo.
Fig. 8.
Fig. 8.
Comparison of different visualization techniques for the C. elegans splice data set based on (A–C) POIM matrices versus (D–F) weight matrices. Position 0 is the splice site.
Fig. 9.
Fig. 9.
POIM visualization for the TSSs of D. melanogaster. (A–C): differential POIMs, POIM weight mass for k=1 … 8 and POIMs for k=3 are displayed.
Fig. 10.
Fig. 10.
POIM visualization for the trans-splicing control elements of C.elegans. (A, B) (A) displays Differential POIMs and the (B) the POIM weight mass for k=1 … 8. (C, D) POIMs for k=2 and POIM k-mer diversity.

Similar articles

Cited by

References

    1. Arkhipova IR. Promoter elements in D. melanogaster revealed by sequence analysis. Genetics. 1995;139:1359–1369. - PMC - PubMed
    1. Barash Y, et al. Modeling depend. in protein-DNA binding sites. In Proceedings of the 7th International Conference in Computational Molecular Biology (RECOMB).2003.
    1. Ben-Gal I, et al. Identification of transcription factor binding sites with variable-order bayesian networks. Bioinformatics. 2005;21:2657–2666. - PubMed
    1. Burke T, Kadonaga T. The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila. Genes Dev. 1997;11:3020–3031. - PMC - PubMed
    1. Chen T-M, et al. Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–482. - PubMed

Publication types