An Efficient Exact Algorithm for Planted Motif Search on Large DNA Sequence Datasets
- PMID: 38801693
- DOI: 10.1109/TCBB.2024.3404136
An Efficient Exact Algorithm for Planted Motif Search on Large DNA Sequence Datasets
Abstract
DNA motif is the pattern shared by similar fragments in DNA sequences, which plays a key role in regulating gene expression, and DNA motif discovery has become a key research topic. Exact planted ( l, d )-motif search (PMS) is one of the motif discovery approaches, which aims to find from t sequences all the ( l, d )-motifs that are motifs of l length appearing in at least qt sequences with at most d mismatches. The existing exact PMS algorithms are only suitable for small datasets of DNA sequences. The development of high-throughput sequencing technology generates vast amount of DNA sequence data, which brings challenges to solving exact PMS problems efficiently. Therefore, we propose an efficient exact PMS algorithm called PMmotif for large datasets of DNA sequences, after analyzing the time complexity of the existing exact PMS algorithms. PMmotif finds ( l, d )-motifs with strategy by searching the branches on the pattern tree that may contain ( l, d )-motifs. It is verified by experiments that the running time ratio of some existing excellent PMS algorithms to PMmotif is between 14.83 and 58.94. In addition, for the first time, PMmotif can solve the ( 15,5 )and ( 17,6 ) challenge problem instances on large DNA sequence datasets (3000 sequences of length 200) within 24 hours.
Similar articles
-
SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.BMC Bioinformatics. 2018 Jun 18;19(1):228. doi: 10.1186/s12859-018-2242-y. BMC Bioinformatics. 2018. PMID: 29914360 Free PMC article.
-
Efficient sequential and parallel algorithms for planted motif search.BMC Bioinformatics. 2014 Jan 31;15:34. doi: 10.1186/1471-2105-15-34. BMC Bioinformatics. 2014. PMID: 24479443 Free PMC article.
-
Improved Exact Enumerative Algorithms for the Planted (l, d)-Motif Search Problem.IEEE/ACM Trans Comput Biol Bioinform. 2014 Mar-Apr;11(2):361-74. doi: 10.1109/TCBB.2014.2306842. IEEE/ACM Trans Comput Biol Bioinform. 2014. PMID: 26355783
-
An extension and novel solution to the (l,d)-motif challenge problem.Genome Inform. 2004;15(2):63-71. Genome Inform. 2004. PMID: 15706492 Review.
-
Discovering sequence motifs.Methods Mol Biol. 2008;452:231-51. doi: 10.1007/978-1-60327-159-2_12. Methods Mol Biol. 2008. PMID: 18566768 Review.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources