Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 29;37(18):2834-2840.
doi: 10.1093/bioinformatics/btab203.

STREME: accurate and versatile sequence motif discovery

Affiliations

STREME: accurate and versatile sequence motif discovery

Timothy L Bailey. Bioinformatics. .

Abstract

Motivation: Sequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences-for example, the binding site motifs of DNA- and RNA-binding proteins.

Results: The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility. Using in vivo DNA (ChIP-seq) and RNA (CLIP-seq) data, and validating motifs with reference motifs derived from in vitro data, we show that STREME is more accurate, sensitive and thorough than several widely used algorithms (DREME, HOMER, MEME, Peak-motifs) and two other representative algorithms (ProSampler and Weeder). STREME's capabilities include the ability to find motifs in datasets with hundreds of thousands of sequences, to find both short and long motifs (from 3 to 30 positions), to perform differential motif discovery in pairs of sequence datasets, and to find motifs in sequences over virtually any alphabet (DNA, RNA, protein and user-defined alphabets). Unlike most motif discovery algorithms, STREME reports a useful estimate of the statistical significance of each motif it discovers. STREME is easy to use individually via its web server or via the command line, and is completely integrated with the widely used MEME Suite of sequence analysis tools. The name STREME stands for 'Simple, Thorough, Rapid, Enriched Motif Elicitation'.

Availability and implementation: The STREME web server and source code are provided freely for non-commercial use at http://meme-suite.org.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Accuracy of motif discovery algorithms on ENCODE TF ChIP-seq datasets. The curves in (a) show the percentage of ChIP-seq datasets (Y) where the best motif found by the named algorithm has motif similarity score X, averaged over 40 ChIP-seq datasets. (b) The sequence logos and accuracies of the best motifs found by STREME (top) and HOMER (bottom) in an ENCODE ChIP-seq dataset for the CTCF TF (UtaK562Ctcf), aligned to the SELEX reference motif (center) from the Jolma et al. (2013) compendium (CTCF_full). Similar alignments are given for all 40 ChIP-seq datasets in Supplementary Material
Fig. 2.
Fig. 2.
Sensitivity of motif discovery algorithms on ENCODE TF ChIP-seq datasets. Each point shows the percentage of times (Y) that the best motif found by the named algorithm has motif similarity score at least five (Tomtom P-value 105) when run on a primary dataset that has been diluted to a given purity (X), averaged over 40 ChIP-seq datasets. Note that the lowest dataset purity tested is 1%
Fig. 3.
Fig. 3.
Thoroughness of motif discovery algorithms on combined ENCODE TF ChIP-seq datasets. The curves show the percentage of 21 reference motifs (Y) for which the named algorithm finds a motif matching it with given motif similarity score (X) or better, averaged over 20 combined datasets
Fig. 4.
Fig. 4.
Speed of motif discovery algorithms on ENCODE TF ChIP-seq datasets. Each point represents the running time (Y) of the named algorithm on one of 40 ENCODE TF ChIP-seq datasets containing the given number of sequences (X). For ease of interpretation, the points for each algorithm have been fit with a smooth Bezier curve
Fig. 5.
Fig. 5.
Accuracy of motif discovery algorithms on ‘small’ datasets. Each point shows the percentage of times (Y) the best motif found by the named algorithm in a (sampled) dataset with the given number of sequences (X) has motif similarity score at least 5 (Tomtom P-value 105), averaged over each of 40 TF ChIP-seq datasets
Fig. 6.
Fig. 6.
Running time as a function of maximum motif width. Each point shows the running time in seconds per motif found (Y) when the named motif finder is run with the given maximum motif width (X) set, averaged over 25 ChIP-seq primary sequence datasets each containing 10 000 sequences of length 100 bp. Error bars show standard error. The points for a given motif discovery algorithm are connected with straight lines for ease of interpretation. The algorithms were run on a 3.2 GHz Intel Core i7 processor with 16 GB of memory
Fig. 7.
Fig. 7.
Accuracy of motif discovery algorithms on ENCODE RBP eCLIP datasets. The curves in the figure show the percentage of times (Y) the best motif found by the named algorithm has motif similarity score X or better, averaged over 20 eCLIP datasets
Fig. 8.
Fig. 8.
Q–Q accuracy plot of the P-values reported by STREME for motifs with at least 10 sites in the hold-out set. Shown is the Q–Q plot for the P-values reported by STREME run on 10 000 datasets containing 10 000 random DNA sequences. Primary and control sequences are 100 characters long. Ideally, the points should lie along the line y = x

References

    1. Bailey T.L. (2011) DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics, 27, 1653–1659. - PMC - PubMed
    1. Bailey T.L., Elkan C. (1995) The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, July 16–19, 1995, Cambridge, UK, Vol. 3, pp. 21–29. - PubMed
    1. Fedotova A.A. et al. (2017) C2h2 zinc finger proteins: the largest but poorly explored family of higher eukaryotic transcription factors. Acta Nat., 9, 47–58. - PMC - PubMed
    1. Fisher R.A. (1922) On the interpretation of χ2 from contingency tables, and the calculation of p. J. R. Stat. Soc., 85, 87–94.
    1. Gupta S. et al. (2007) Quantifying similarity between motifs. Genome Biol., 8, R24. - PMC - PubMed

Publication types