SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets

Qiang Yu¹, Dingbang Wei¹, Hongwei Huo²

Affiliations

¹ School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.
² School of Computer Science and Technology, Xidian University, Xi'an, 710071, China. hwhuo@mail.xidian.edu.cn.

PMID: 29914360
PMCID: PMC6006848
DOI: 10.1186/s12859-018-2242-y

SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets

Qiang Yu et al. BMC Bioinformatics. 2018.

. 2018 Jun 18;19(1):228.

doi: 10.1186/s12859-018-2242-y.

Authors

Qiang Yu¹, Dingbang Wei¹, Hongwei Huo²

Affiliations

¹ School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.
² School of Computer Science and Technology, Xidian University, Xi'an, 710071, China. hwhuo@mail.xidian.edu.cn.

PMID: 29914360
PMCID: PMC6006848
DOI: 10.1186/s12859-018-2242-y

Abstract

Background: Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more.

Results: We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D.

Conclusions: We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

Keywords: Quorum planted motif search; Sample sequences; Transcription factor binding sites.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Illustration of word count with mismatches. This figure shows an illustration of word count with up to k mismatches

**Fig. 2**
Illustration of obtaining high-frequency substrings. This figure illustrates the process of obtaining high-frequency substrings. N_r and N_m are *count*_k(w-mer) for a background substring and a motif instance in the random case, respectively

**Fig. 3**
Results on the ENCODE TF ChIP-seq data. This figure shows the results on the eight *Homo sapiens* datasets selected from the ENCODE TF ChIP-seq data

**Fig. 4**
Results on the mESC data. This figure shows the results on the 12 mouse datasets in the mESC data

See this image and copyright information in PMC

Cited by

Expanding the DNA-encoded library toolbox: identifying small molecules targeting RNA.
Chen Q, Li Y, Lin C, Chen L, Luo H, Xia S, Liu C, Cheng X, Liu C, Li J, Dou D. Chen Q, et al. Nucleic Acids Res. 2022 Jul 8;50(12):e67. doi: 10.1093/nar/gkac173. Nucleic Acids Res. 2022. PMID: 35288754 Free PMC article.
A Clustering Approach for Motif Discovery in ChIP-Seq Dataset.
Sun CX, Yang Y, Wang H, Wang WH. Sun CX, et al. Entropy (Basel). 2019 Aug 16;21(8):802. doi: 10.3390/e21080802. Entropy (Basel). 2019. PMID: 33267515 Free PMC article.

References

1. D’haeseleer P. How does DNA sequence motif discovery work. Nat Biotechnol. 2006;24(8):959–961. doi: 10.1038/nbt0806-959. - DOI - PubMed
1. Wong KC, Chan TM, Peng C, Li Y, Zhang Z. DNA motif elucidation using belief propagation. Nucleic Acids Res. 2013;41(16):e153. doi: 10.1093/nar/gkt574. - DOI - PMC - PubMed
1. Weirauch MT, Yang A, Albu M, Cote A, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey M, Huang E, Mukherjee T, Chen X, Reece-Hoyes JS, Govindarajan S, Shaulsky G, Walhout AJM, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158(6):1431–1443. doi: 10.1016/j.cell.2014.08.009. - DOI - PMC - PubMed
1. Wong KC. MotifHyades: expectation maximization for de novo DNA motif pair discovery on paired sequences. Bioinformatics. 2017;33(19):3028–3035. doi: 10.1093/bioinformatics/btx381. - DOI - PubMed
1. Pevzner PA, Sze SH. Combinatorial approaches to finding subtle signals in DNA sequences. In: Altman R, Bailey TL, editors. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. California: AAAI Press; 2000. pp. 269–278. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed