. 2009 Aug 15;25(16):2126-33.

doi: 10.1093/bioinformatics/btp278. Epub 2009 Apr 23.

KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Sebastian J Schultheiss¹, Wolfgang Busch, Jan U Lohmann, Oliver Kohlbacher, Gunnar Rätsch

Affiliations

PMID: 19389732
PMCID: PMC2722996
DOI: 10.1093/bioinformatics/btp278

KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Sebastian J Schultheiss et al. Bioinformatics. 2009.

. 2009 Aug 15;25(16):2126-33.

doi: 10.1093/bioinformatics/btp278. Epub 2009 Apr 23.

Authors

Sebastian J Schultheiss¹, Wolfgang Busch, Jan U Lohmann, Oliver Kohlbacher, Gunnar Rätsch

Affiliation

¹ Friedrich Miescher Laboratory of the Max Planck Society, and Max Planck Institute for Developmental Biology, Tübingen, Germany. sebi@tuebingen.mpg.de

PMID: 19389732
PMCID: PMC2722996
DOI: 10.1093/bioinformatics/btp278

Abstract

Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules.

Results: We propose a new algorithm that combines the benefits of existing motif finding with the ones of support vector machines (SVMs) to find degenerate motifs in order to improve the modeling of regulatory modules. In experiments on microarray data from Arabidopsis thaliana, we were able to show that the newly developed strategy significantly improves the recognition of TF targets.

Availability: The python source code (open source-licensed under GPL), the data for the experiments and a Galaxy-based web service are available at http://www.fml.mpg.de/raetsch/suppl/kirmes/.

PubMed Disclaimer

Figures

**Fig. 1.**
The idea behind the RM kernel: a motif finder is applied to the regulatory sequences in the input set (long, dark bars), which identifies overrepresented motifs (short, light bars). The best matching motifs (boxed) in every sequence serve as starting points, where we excise a window of 20 bp around the center of each motif occurrence for the WDSC kernel. Conservation information for these windows is looked up in a precomputed multiple genome alignment (*cf.* Section S.2 of the Supplementary Material for details on conservation data). Additionally, we construct an input vector for the RBF kernel of the pairwise motif distance, and distance to the transcription start (if available).

**Fig. 2.**
A cartoon workflow of the kirmes pipeline: the preprocessing step requires the genomic sequence and a set of regulatory sequences from genes that were determined to be co-expressed in microarray experiments, and ideally a negative set. kirmes conducts a motif finding step, where it locates the positions of overrepresented motifs in fasta files of the genes' regulatory region. For the classification, we build an input vector with sequence sections of 20 bp, centered around the motif positions obtained during the motif finding step, and optional conservation information from related genome sequences for the WDSC kernel, as described in Section 2.3. The classifier is trained on the labeled dataset of positives and negatives and can then be applied repeatedly on unlabeled prediction datasets to classify genes as co-regulated by the same mechanism as the input dataset or not.

**Fig. 3.**
(A) Accuracy of the spectrum and WDS kernels: the prediction is rarely better than random guessing for these kernels. The kernels are not well-suited for this particular problem. The names of the gene sets are derived from the tair7 annotation by the Swarbreck,D. *et al.* (2007) and are explained in detail in Section S.1 of the Supplementary Material. (B) Accuracy of variations of the kirmes approach: this graph shows a comparison of the basic kernels and the conservation kernels (C) combined with two different motif generation approaches: by oligo-counting (Oligo) or by Gibbs sampling (Gibbs). The average performance (μ) is given for each kernel variant. The first set is taken from a control experiment, where no overrepresented motifs should occur.

**Fig. 4.**
Comparison between the contributions of each feature type of the input vector. A dataset of 42 positive and 1562 negative genes was used from the *A.thaliana* experiments.

**Fig. 5.**
Comparison between the Gibbs sampler priority and the kirmes approach for the task of identifying genes regulated by a TF. The box plot shows the average auROC on the 268 gene sets, giving the minimum, maximum, median, first and third quartile values (cf. Sections 3.4 and 4.3).

See this image and copyright information in PMC

References

1. Bailey T, Elkan C. Proceedings of ISMB'94. Vol. 2. Menlo Park, CA, USA, ISCB: AAAI Press; 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers; pp. 28–36. - PubMed
1. Ben-Hur A, et al. Support vector machines and kernels for computational biology. PLoS Comput. Biol. 2008;4:e1000173. - PMC - PubMed
1. Boser B, et al. Proceedings COLT '92. Pittsburgh, Pennsylvania: ACM Press; 1992. A training algorithm for optimal margin classifiers; pp. 144–152.
1. Busch W, et al. Identification of novel heat shock factor-dependent genes and biochemical pathways in A. thaliana. Plant J. 2005;41:1–14. - PubMed
1. Frith M, et al. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol. 2008;4:e1000071. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Affiliation

KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous