Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 3;39(4):btad141.
doi: 10.1093/bioinformatics/btad141.

dipwmsearch: a Python package for searching di-PWM motifs

Affiliations

dipwmsearch: a Python package for searching di-PWM motifs

Marie Mille et al. Bioinformatics. .

Abstract

Motivation: Seeking probabilistic motifs in a sequence is a common task to annotate putative transcription factor binding sites or other RNA/DNA binding sites. Useful motif representations include position weight matrices (PWMs), dinucleotide PWMs (di-PWMs), and hidden Markov models (HMMs). Dinucleotide PWMs not only combine the simplicity of PWMs-a matrix form and a cumulative scoring function-but also incorporate dependency between adjacent positions in the motif (unlike PWMs which disregard any dependency). For instance to represent binding sites, the HOCOMOCO database provides di-PWM motifs derived from experimental data. Currently, two programs, SPRy-SARUS and MOODS, can search for occurrences of di-PWMs in sequences.

Results: We propose a Python package called dipwmsearch, which provides an original and efficient algorithm for this task (it first enumerates matching words for the di-PWM, and then searches these all at once in the sequence, even if the latter contains IUPAC codes). The user benefits from an easy installation via Pypi or conda, a comprehensive documentation, and executable scripts that facilitate the use of di-PWMs.

Availability and implementation: dipwmsearch is available at https://pypi.org/project/dipwmsearch/ and https://gite.lirmm.fr/rivals/dipwmsearch/ under Cecill license.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(a) Enumeration and scanning strategy for a di-PWM. Left part shows how the score of two words are computed by summing the score of their five dinucleotides. If the score lies above the threshold, the word is a valid word and is added to the list for later search. Right part: we build an Aho–Corasick automaton with all valid words in the list, then use the automaton to scan the sequence. (b) Illustration of the branch and bound strategy for the enumeration procedure. We build a trie for words starting with letter A, and explore it in Depth-First manner. As soon as a prefix cannot give rise to a valid word, which is determined using the LookAheadMatrix (LAM), we cut the corresponding branch. Only valid words generate a leaf in the trie. (c) Comparison of SPRy-SARUS and dipwmsearch for searching all Human di-PWMs from HOCOMOCO on Human chromosome 15. The violin plot shows the running times over for all di-PWMs and their median for both tools

References

    1. Aho A, Corasick M.. Efficient string matching: an aid to bibliographic search. Commun ACM 1975;18:333–40.
    1. Beckstette M, Homann R, Giegerich R. et al. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 2006;7:389. 10.1186/1471-2105-7-389. - DOI - PMC - PubMed
    1. Korhonen J, Martinmäki P, Pizzi C. et al. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 2009;25:3181–2. - PMC - PubMed
    1. Korhonen JH, Palin K, Taipale J. et al. Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics 2017;33:514–21. - PubMed
    1. Kulakovskiy I, Levitsky V, Oshchepkov D. et al. From binding motifs in chip-seq data to improved models of transcription factor binding sites. J Bioinform Comput Biol 2013;11:1340004. - PubMed

Publication types