Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec 22;17(Suppl 19):502.
doi: 10.1186/s12859-016-1364-3.

A study on the application of topic models to motif finding algorithms

Affiliations

A study on the application of topic models to motif finding algorithms

Josep Basha Gutierrez et al. BMC Bioinformatics. .

Abstract

Background: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients.

Results: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level.

Conclusions: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Representation of a topic model adapted to the motif finding problem. Representation of a topic model adapted to the motif finding problem. This figure shows the basic structure of a topic model (in this case, a LDA). The terms specific for the case of the motif finding problem are stated in red under the original ones in blue, showing that the motif finding problem can be represented by a topic model by describing the DNA sequences as documents, the instances of each given motif as words in those documents, and the motifs as clusters of words or topics
Fig. 2
Fig. 2
Statistical GA algorithm workflow after including the use of topic models. This figure describes the updated flow of the Statistical GA algorithm after adding the perplexity measurement for the selection of solutions. There are four steps in this flow: the first one, in which the candidate instances are selected by the original Statistical GA; the second one, in which these instances are clustered attending to their similarity calculated by their hamming distance; the third one, which consists of building the CTM and measuring its perplexity, and the last step, which consists of reporting the motif if the perplexity calculated in the previous step is lower than 100
Fig. 3
Fig. 3
Average statistical values for all 56 data sets. This figure shows the average scores obtained by each one of the tools studied for each one of seven different statistics for all the 56 data sets of the benchmark. The Statistical GA method is shown as GA approach, and the CTM method as CTM approach
Fig. 4
Fig. 4
Average statistical values for each organism. This figure shows the average scores obtained by each one of the tools studied for each one of seven different statistics grouped by the four different species contained in the data sets. The Statistical GA method is shown as GA approach, and the CTM method as CTM approach
Fig. 5
Fig. 5
Comparison of statistics for the Statistical GA method before and after filtering by perplexity. Here it is shown the improvement in the average scores of the Statistical GA method for seven different statistics obtained by the addition of the filtering of the results by their perplexity in the corresponding CTM

References

    1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23(1):137–47. doi: 10.1038/nbt1053. - DOI - PubMed
    1. Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinf. 2007;8(7):1. - PMC - PubMed
    1. Blei DM. Probabilistic topic models. Commun. ACM. 2012;55(4):77–84. doi: 10.1145/2133806.2133826. - DOI
    1. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J. Mach. Learn. Res. 2003;3(Jan):993–1022.
    1. Gutierrez JB, Frith M, Nakai K. A Genetic Algorithm for Motif Finding Based on Statistical Significance. In: International Conference on Bioinformatics and Biomedical Engineering. Granada: Springer International Publishing; 2015. p. 438–49.

Substances

LinkOut - more resources