Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep-Oct;21(5):1542-1551.
doi: 10.1109/TCBB.2024.3404136. Epub 2024 Oct 9.

An Efficient Exact Algorithm for Planted Motif Search on Large DNA Sequence Datasets

An Efficient Exact Algorithm for Planted Motif Search on Large DNA Sequence Datasets

Qiang Yu et al. IEEE/ACM Trans Comput Biol Bioinform. 2024 Sep-Oct.

Abstract

DNA motif is the pattern shared by similar fragments in DNA sequences, which plays a key role in regulating gene expression, and DNA motif discovery has become a key research topic. Exact planted ( l, d )-motif search (PMS) is one of the motif discovery approaches, which aims to find from t sequences all the ( l, d )-motifs that are motifs of l length appearing in at least qt sequences with at most d mismatches. The existing exact PMS algorithms are only suitable for small datasets of DNA sequences. The development of high-throughput sequencing technology generates vast amount of DNA sequence data, which brings challenges to solving exact PMS problems efficiently. Therefore, we propose an efficient exact PMS algorithm called PMmotif for large datasets of DNA sequences, after analyzing the time complexity of the existing exact PMS algorithms. PMmotif finds ( l, d )-motifs with strategy by searching the branches on the pattern tree that may contain ( l, d )-motifs. It is verified by experiments that the running time ratio of some existing excellent PMS algorithms to PMmotif is between 14.83 and 58.94. In addition, for the first time, PMmotif can solve the ( 15,5 )and ( 17,6 ) challenge problem instances on large DNA sequence datasets (3000 sequences of length 200) within 24 hours.

PubMed Disclaimer

Similar articles

References

Publication types

MeSH terms