. 2008 Jul 1;24(13):i6-14.

doi: 10.1093/bioinformatics/btn170.

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors

Sören Sonnenburg¹, Alexander Zien, Petra Philips, Gunnar Rätsch

Affiliations

PMID: 18586746
PMCID: PMC2718648
DOI: 10.1093/bioinformatics/btn170

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors

Sören Sonnenburg et al. Bioinformatics. 2008.

. 2008 Jul 1;24(13):i6-14.

doi: 10.1093/bioinformatics/btn170.

Authors

Sören Sonnenburg¹, Alexander Zien, Petra Philips, Gunnar Rätsch

Affiliation

¹ Fraunhofer Institute FIRST, Department IDA, Kekulèstr. 7, 12489 Berlin, Germany. Soeren.Sonnenburg@first.fraunhofer.de

PMID: 18586746
PMCID: PMC2718648
DOI: 10.1093/bioinformatics/btn170

Abstract

Motivation: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.

Results: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena.

Availability: All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1**
Example of WD kernel of order K=3. In the shown case,k(x, x′)=21β₁+8β₂+3β₃.

**Fig. 2.**
Substrings, superstrings, left partial overlaps and right partial overlaps: definition and examples for the string AATACGTAC).

**Fig. 3.**
The theorem which enables efficient POIM computation for zeroth-order Markov chains (Zien *et al.*, 2007).

**Fig. 4.**
Two (k−1)-mers are covered by a k-mer.

**Fig. 5.**
Comparison of different visualization techniques for the fixed-position-motifs experiment. Motifs GATTACA and AGTAGTG were inserted at positions 10 and 30 respectively with growing level of mutation (i.e. number of nucleotides randomly substituted in the motifs) from left to right. SVMs classifiers were trained to distinguish random sequences from sequences with the (mutated) motifs GATTACA and AGTAGT inserted. (**A–D**) We computed Differential POIMs [Equation (9)] for up to 8mers, from a WD-kernel SVM of order 20. Here each figure displays the importance of k-mer lengths (y-axis) for k=1 … 8 at each position (x-axis) (i=1 … 50 as a heat map. Red and yellow color denotes relevant motifs, dark blue corresponds to motifs not conveying information about the problem. 1mers are at the bottom of the plot, 8mers at the top. (**E–H**) K-mer scoring overview (SVM-w) was computed using the same setup as for differential POIMs. The SVM-w is again displayed as a heat map. (**I–L**) It was obtained using MKL (averaged weighting obtained using 100 bootstrap runs, (Rätsch *et al.*, 2006). Again the res0ult is displayed as a heat map, but for 1-to 7mers only. For a more detailed discussion see text.

**Fig. 6.**
(A) versus sequence logo (B) for motifs GATTACA and AGTAGTG at positions 10 and 30, respectively, with 4-out-of-7 mutations in the motifs.

**Fig. 7.**
Comparison of different visualization techniques for the varying-positions-motif experiment. The mutated motif GATTACA was inserted at positions 0+−13 in uniformly distributed sequences. (**A–C**) shows the Differential POIM matrices [cf. Equation (9)] as a heat map, the POIM weight mass for different k=1 … 8 and the POIM k-mer diversity for k=3 as a heat map; (**D–F**) shows the SVM-w overview plot as a heat map, the SVM-w weight mass also for k=1 … 8 and the k−mer diversity for k=3 as a heat map; (G) sequence logo.

**Fig. 8.**
Comparison of different visualization techniques for the *C. elegans* splice data set based on (**A–C**) POIM matrices versus (**D–F**) weight matrices. Position 0 is the splice site.

**Fig. 9.**
POIM visualization for the TSSs of *D. melanogaster*. (**A–C**): differential POIMs, POIM weight mass for k=1 … 8 and POIMs for k=3 are displayed.

**Fig. 10.**
POIM visualization for the *trans*-splicing control elements of *C.elegans*. (A, B) (A) displays Differential POIMs and the (B) the POIM weight mass for k=1 … 8. (C, D) POIMs for k=2 and POIM k-mer diversity.

See this image and copyright information in PMC

Cited by

Exploring sequence characteristics related to high-level production of secreted proteins in Aspergillus niger.
van den Berg BA, Reinders MJ, Hulsman M, Wu L, Pel HJ, Roubos JA, de Ridder D. van den Berg BA, et al. PLoS One. 2012;7(10):e45869. doi: 10.1371/journal.pone.0045869. Epub 2012 Oct 1. PLoS One. 2012. PMID: 23049690 Free PMC article.
KIRMES: kernel-based identification of regulatory modules in euchromatic sequences.
Schultheiss SJ, Busch W, Lohmann JU, Kohlbacher O, Rätsch G. Schultheiss SJ, et al. Bioinformatics. 2009 Aug 15;25(16):2126-33. doi: 10.1093/bioinformatics/btp278. Epub 2009 Apr 23. Bioinformatics. 2009. PMID: 19389732 Free PMC article.
Estimation of diffusion coefficients from voltammetric signals by support vector and gaussian process regression.
Bogdan M, Brugger D, Rosenstiel W, Speiser B. Bogdan M, et al. J Cheminform. 2014 May 28;6:30. doi: 10.1186/1758-2946-6-30. eCollection 2014. J Cheminform. 2014. PMID: 24987463 Free PMC article.
Improving HIV coreceptor usage prediction in the clinic using hints from next-generation sequencing data.
Pfeifer N, Lengauer T. Pfeifer N, et al. Bioinformatics. 2012 Sep 15;28(18):i589-i595. doi: 10.1093/bioinformatics/bts373. Bioinformatics. 2012. PMID: 22962486 Free PMC article.
Interpretable machine learning for genomics.
Watson DS. Watson DS. Hum Genet. 2022 Sep;141(9):1499-1513. doi: 10.1007/s00439-021-02387-9. Epub 2021 Oct 20. Hum Genet. 2022. PMID: 34669035 Free PMC article.

See all "Cited by" articles

References

1. Arkhipova IR. Promoter elements in D. melanogaster revealed by sequence analysis. Genetics. 1995;139:1359–1369. - PMC - PubMed
1. Barash Y, et al. Modeling depend. in protein-DNA binding sites. In Proceedings of the 7th International Conference in Computational Molecular Biology (RECOMB).2003.
1. Ben-Gal I, et al. Identification of transcription factor binding sites with variable-order bayesian networks. Bioinformatics. 2005;21:2657–2666. - PubMed
1. Burke T, Kadonaga T. The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila. Genes Dev. 1997;11:3020–3031. - PMC - PubMed
1. Chen T-M, et al. Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–482. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors

Affiliation

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous