Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Aug 15;25(16):2126-33.
doi: 10.1093/bioinformatics/btp278. Epub 2009 Apr 23.

KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Affiliations

KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Sebastian J Schultheiss et al. Bioinformatics. .

Abstract

Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules.

Results: We propose a new algorithm that combines the benefits of existing motif finding with the ones of support vector machines (SVMs) to find degenerate motifs in order to improve the modeling of regulatory modules. In experiments on microarray data from Arabidopsis thaliana, we were able to show that the newly developed strategy significantly improves the recognition of TF targets.

Availability: The python source code (open source-licensed under GPL), the data for the experiments and a Galaxy-based web service are available at http://www.fml.mpg.de/raetsch/suppl/kirmes/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The idea behind the RM kernel: a motif finder is applied to the regulatory sequences in the input set (long, dark bars), which identifies overrepresented motifs (short, light bars). The best matching motifs (boxed) in every sequence serve as starting points, where we excise a window of 20 bp around the center of each motif occurrence for the WDSC kernel. Conservation information for these windows is looked up in a precomputed multiple genome alignment (cf. Section S.2 of the Supplementary Material for details on conservation data). Additionally, we construct an input vector for the RBF kernel of the pairwise motif distance, and distance to the transcription start (if available).
Fig. 2.
Fig. 2.
A cartoon workflow of the kirmes pipeline: the preprocessing step requires the genomic sequence and a set of regulatory sequences from genes that were determined to be co-expressed in microarray experiments, and ideally a negative set. kirmes conducts a motif finding step, where it locates the positions of overrepresented motifs in fasta files of the genes' regulatory region. For the classification, we build an input vector with sequence sections of 20 bp, centered around the motif positions obtained during the motif finding step, and optional conservation information from related genome sequences for the WDSC kernel, as described in Section 2.3. The classifier is trained on the labeled dataset of positives and negatives and can then be applied repeatedly on unlabeled prediction datasets to classify genes as co-regulated by the same mechanism as the input dataset or not.
Fig. 3.
Fig. 3.
(A) Accuracy of the spectrum and WDS kernels: the prediction is rarely better than random guessing for these kernels. The kernels are not well-suited for this particular problem. The names of the gene sets are derived from the tair7 annotation by the Swarbreck,D. et al. (2007) and are explained in detail in Section S.1 of the Supplementary Material. (B) Accuracy of variations of the kirmes approach: this graph shows a comparison of the basic kernels and the conservation kernels (C) combined with two different motif generation approaches: by oligo-counting (Oligo) or by Gibbs sampling (Gibbs). The average performance (μ) is given for each kernel variant. The first set is taken from a control experiment, where no overrepresented motifs should occur.
Fig. 4.
Fig. 4.
Comparison between the contributions of each feature type of the input vector. A dataset of 42 positive and 1562 negative genes was used from the A.thaliana experiments.
Fig. 5.
Fig. 5.
Comparison between the Gibbs sampler priority and the kirmes approach for the task of identifying genes regulated by a TF. The box plot shows the average auROC on the 268 gene sets, giving the minimum, maximum, median, first and third quartile values (cf. Sections 3.4 and 4.3).

References

    1. Bailey T, Elkan C. Proceedings of ISMB'94. Vol. 2. Menlo Park, CA, USA, ISCB: AAAI Press; 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers; pp. 28–36. - PubMed
    1. Ben-Hur A, et al. Support vector machines and kernels for computational biology. PLoS Comput. Biol. 2008;4:e1000173. - PMC - PubMed
    1. Boser B, et al. Proceedings COLT '92. Pittsburgh, Pennsylvania: ACM Press; 1992. A training algorithm for optimal margin classifiers; pp. 144–152.
    1. Busch W, et al. Identification of novel heat shock factor-dependent genes and biochemical pathways in A. thaliana. Plant J. 2005;41:1–14. - PubMed
    1. Frith M, et al. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol. 2008;4:e1000071. - PMC - PubMed

Publication types