. 2021 Sep 29;37(18):2834-2840.

doi: 10.1093/bioinformatics/btab203.

STREME: accurate and versatile sequence motif discovery

Timothy L Bailey¹

Affiliations

PMID: 33760053
PMCID: PMC8479671
DOI: 10.1093/bioinformatics/btab203

STREME: accurate and versatile sequence motif discovery

Timothy L Bailey. Bioinformatics. 2021.

. 2021 Sep 29;37(18):2834-2840.

doi: 10.1093/bioinformatics/btab203.

Author

Timothy L Bailey¹

Affiliation

¹ Department of Pharmacology, University of Nevada, Reno, NV 89557, USA.

PMID: 33760053
PMCID: PMC8479671
DOI: 10.1093/bioinformatics/btab203

Abstract

Motivation: Sequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences-for example, the binding site motifs of DNA- and RNA-binding proteins.

Results: The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility. Using in vivo DNA (ChIP-seq) and RNA (CLIP-seq) data, and validating motifs with reference motifs derived from in vitro data, we show that STREME is more accurate, sensitive and thorough than several widely used algorithms (DREME, HOMER, MEME, Peak-motifs) and two other representative algorithms (ProSampler and Weeder). STREME's capabilities include the ability to find motifs in datasets with hundreds of thousands of sequences, to find both short and long motifs (from 3 to 30 positions), to perform differential motif discovery in pairs of sequence datasets, and to find motifs in sequences over virtually any alphabet (DNA, RNA, protein and user-defined alphabets). Unlike most motif discovery algorithms, STREME reports a useful estimate of the statistical significance of each motif it discovers. STREME is easy to use individually via its web server or via the command line, and is completely integrated with the widely used MEME Suite of sequence analysis tools. The name STREME stands for 'Simple, Thorough, Rapid, Enriched Motif Elicitation'.

Availability and implementation: The STREME web server and source code are provided freely for non-commercial use at http://meme-suite.org.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Accuracy of motif discovery algorithms on ENCODE TF ChIP-seq datasets. The curves in (a) show the percentage of ChIP-seq datasets (Y) where the best motif found by the named algorithm has motif similarity score $\geq X$ , averaged over 40 ChIP-seq datasets. (b) The sequence logos and accuracies of the best motifs found by STREME (top) and HOMER (bottom) in an ENCODE ChIP-seq dataset for the CTCF TF (UtaK562Ctcf), aligned to the SELEX reference motif (center) from the Jolma *et al.* (2013) compendium (CTCF_full). Similar alignments are given for all 40 ChIP-seq datasets in Supplementary Material

**Fig. 2.**
Sensitivity of motif discovery algorithms on ENCODE TF ChIP-seq datasets. Each point shows the percentage of times (Y) that the best motif found by the named algorithm has motif similarity score at least five (Tomtom P-value $\leq 10^{- 5}$ ) when run on a primary dataset that has been diluted to a given purity (X), averaged over 40 ChIP-seq datasets. Note that the lowest dataset purity tested is 1%

**Fig. 3.**
Thoroughness of motif discovery algorithms on combined ENCODE TF ChIP-seq datasets. The curves show the percentage of 21 reference motifs (Y) for which the named algorithm finds a motif matching it with given motif similarity score (X) or better, averaged over 20 combined datasets

**Fig. 4.**
Speed of motif discovery algorithms on ENCODE TF ChIP-seq datasets. Each point represents the running time (Y) of the named algorithm on one of 40 ENCODE TF ChIP-seq datasets containing the given number of sequences (X). For ease of interpretation, the points for each algorithm have been fit with a smooth Bezier curve

**Fig. 5.**
Accuracy of motif discovery algorithms on ‘small’ datasets. Each point shows the percentage of times (Y) the best motif found by the named algorithm in a (sampled) dataset with the given number of sequences (X) has motif similarity score at least 5 (Tomtom P-value $\leq 10^{- 5}$ ), averaged over each of 40 TF ChIP-seq datasets

**Fig. 6.**
Running time as a function of maximum motif width. Each point shows the running time in seconds per motif found (Y) when the named motif finder is run with the given maximum motif width (X) set, averaged over 25 ChIP-seq primary sequence datasets each containing 10 000 sequences of length 100 bp. Error bars show standard error. The points for a given motif discovery algorithm are connected with straight lines for ease of interpretation. The algorithms were run on a 3.2 GHz Intel Core i7 processor with 16 GB of memory

**Fig. 7.**
Accuracy of motif discovery algorithms on ENCODE RBP eCLIP datasets. The curves in the figure show the percentage of times (Y) the best motif found by the named algorithm has motif similarity score X or better, averaged over 20 eCLIP datasets

**Fig. 8.**
Q–Q accuracy plot of the P-values reported by STREME for motifs with at least 10 sites in the hold-out set. Shown is the Q–Q plot for the P-values reported by STREME run on 10 000 datasets containing 10 000 random DNA sequences. Primary and control sequences are 100 characters long. Ideally, the points should lie along the line y = x

See this image and copyright information in PMC

References

1. Bailey T.L. (2011) DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics, 27, 1653–1659. - PMC - PubMed
1. Bailey T.L., Elkan C. (1995) The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, July 16–19, 1995, Cambridge, UK, Vol. 3, pp. 21–29. - PubMed
1. Fedotova A.A. et al. (2017) C2h2 zinc finger proteins: the largest but poorly explored family of higher eukaryotic transcription factors. Acta Nat., 9, 47–58. - PMC - PubMed
1. Fisher R.A. (1922) On the interpretation of $χ^{2}$ from contingency tables, and the calculation of p. J. R. Stat. Soc., 85, 87–94.
1. Gupta S. et al. (2007) Quantifying similarity between motifs. Genome Biol., 8, R24. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 GM103544/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

STREME: accurate and versatile sequence motif discovery

Affiliation

STREME: accurate and versatile sequence motif discovery

Author

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases