A study on the application of topic models to motif finding algorithms

Josep Basha Gutierrez^{1

2}, Kenta Nakai^{3

4}

Affiliations

¹ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan.
² Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639, Tokyo, Japan.
³ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan. knakai@ims.u-tokyo.ac.jp.
⁴ Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639, Tokyo, Japan. knakai@ims.u-tokyo.ac.jp.

PMID: 28155646
PMCID: PMC5259985
DOI: 10.1186/s12859-016-1364-3

A study on the application of topic models to motif finding algorithms

Josep Basha Gutierrez et al. BMC Bioinformatics. 2016.

. 2016 Dec 22;17(Suppl 19):502.

doi: 10.1186/s12859-016-1364-3.

Authors

Josep Basha Gutierrez^{1

2}, Kenta Nakai^{3

4}

Affiliations

¹ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan.
² Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639, Tokyo, Japan.
³ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan. knakai@ims.u-tokyo.ac.jp.
⁴ Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639, Tokyo, Japan. knakai@ims.u-tokyo.ac.jp.

PMID: 28155646
PMCID: PMC5259985
DOI: 10.1186/s12859-016-1364-3

Abstract

Background: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients.

Results: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level.

Conclusions: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.

PubMed Disclaimer

Figures

**Fig. 1**
Representation of a topic model adapted to the motif finding problem. Representation of a topic model adapted to the motif finding problem. This figure shows the basic structure of a topic model (in this case, a LDA). The terms specific for the case of the motif finding problem are stated in red under the original ones in blue, showing that the motif finding problem can be represented by a topic model by describing the DNA sequences as documents, the instances of each given motif as words in those documents, and the motifs as clusters of words or topics

**Fig. 2**
Statistical GA algorithm workflow after including the use of topic models. This figure describes the updated flow of the Statistical GA algorithm after adding the perplexity measurement for the selection of solutions. There are four steps in this flow: the first one, in which the candidate instances are selected by the original Statistical GA; the second one, in which these instances are clustered attending to their similarity calculated by their hamming distance; the third one, which consists of building the CTM and measuring its perplexity, and the last step, which consists of reporting the motif if the perplexity calculated in the previous step is lower than 100

**Fig. 3**
Average statistical values for all 56 data sets. This figure shows the average scores obtained by each one of the tools studied for each one of seven different statistics for all the 56 data sets of the benchmark. The Statistical GA method is shown as GA approach, and the CTM method as CTM approach

**Fig. 4**
Average statistical values for each organism. This figure shows the average scores obtained by each one of the tools studied for each one of seven different statistics grouped by the four different species contained in the data sets. The Statistical GA method is shown as GA approach, and the CTM method as CTM approach

**Fig. 5**
Comparison of statistics for the Statistical GA method before and after filtering by perplexity. Here it is shown the improvement in the average scores of the Statistical GA method for seven different statistics obtained by the addition of the filtering of the results by their perplexity in the corresponding CTM

See this image and copyright information in PMC

References

1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23(1):137–47. doi: 10.1038/nbt1053. - DOI - PubMed
1. Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinf. 2007;8(7):1. - PMC - PubMed
1. Blei DM. Probabilistic topic models. Commun. ACM. 2012;55(4):77–84. doi: 10.1145/2133806.2133826. - DOI
1. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J. Mach. Learn. Res. 2003;3(Jan):993–1022.
1. Gutierrez JB, Frith M, Nakai K. A Genetic Algorithm for Motif Finding Based on Statistical Significance. In: International Conference on Bioinformatics and Biomedical Engineering. Granada: Springer International Publishing; 2015. p. 438–49.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A study on the application of topic models to motif finding algorithms

Affiliations

A study on the application of topic models to motif finding algorithms

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources