. 2007 Oct 3;2(10):e967.

doi: 10.1371/journal.pone.0000967.

SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins

Richard J Edwards¹, Norman E Davey, Denis C Shields

Affiliations

PMID: 17912346
PMCID: PMC1989135
DOI: 10.1371/journal.pone.0000967

SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins

Richard J Edwards et al. PLoS One. 2007.

. 2007 Oct 3;2(10):e967.

doi: 10.1371/journal.pone.0000967.

Authors

Richard J Edwards¹, Norman E Davey, Denis C Shields

Affiliation

¹ University College Dublin Complex and Adaptive Systems Laboratory, University College Dublin, Dublin, Ireland. r.edwards@soton.ac.uk

PMID: 17912346
PMCID: PMC1989135
DOI: 10.1371/journal.pone.0000967

Abstract

Background: Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular, it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates the matter.

Methodology/principal findings: In this paper we present two algorithms. SLiMBuild identifies convergently evolved, short motifs in a dataset of proteins. Motifs are built by combining dimers into longer patterns, retaining only those motifs occurring in a sufficient number of unrelated proteins. Motifs with fixed amino acid positions are identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. The algorithm is computationally efficient compared to alternatives, particularly when datasets include homologous proteins, and provides great flexibility in the nature of motifs returned. The SLiMChance algorithm estimates the probability of returned motifs arising by chance, correcting for the size and composition of the dataset, and assigns a significance value to each motif. These algorithms are implemented in a software package, SLiMFinder. SLiMFinder default settings identify known SLiMs with 100% specificity, and have a low false discovery rate on random test data.

Conclusions/significance: The efficiency of SLiMBuild and low false discovery rate of SLiMChance make SLiMFinder highly suited to high throughput motif discovery and individual high quality analyses alike. Examples of such analyses on real biological data, and how SLiMFinder results can help direct future discoveries, are provided. SLiMFinder is freely available for download under a GNU license from http://bioinformatics.ucd.ie/shields/software/slimfinder/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Overview of SLiMFinder.**
An input dataset is first clustered into unrelated protein clusters (UPC) using a treatment of BLAST results to identify evolutionary relationships. The dataset is also masked according to user choices, masking out predicted ordered regions, selected UniProt features, low complexity regions and/or N-terminal methionines. This (masked) dataset is then processed by the SLiMBuild algorithm to identify motifs that are shared by unrelated proteins. A TEIRESIAS-style output of all motifs can be produced at this point. Amino acid frequencies are calculated for each cluster of unrelated proteins, either before or after masking, and may be retained as cluster-specific frequencies or averaged over all clusters. Alternatively, amino acid frequencies may be given from an external source. These frequencies are combined with data from SLiMBuild on the motif composition of the dataset and processed by the SLiMChance algorithm, which identifies significantly over-represented motifs. These motifs and additional dataset information are then output into results files.

**Figure 2. SLiMBuild construction of motifs.**
A. Dimer construction. For each position in a sequence, each possible wildcard length x is used to find possible “i-x-j” dimers. Dimers containing masked (“X”) residues are ignored (greyed dimers). Note that the n-terminal “^” marker is treated as any other amino acid. B. Motif extension. Longer SLiMs are constructed during the SLiMBuild process by matching the occurrences of shorter SLiMs with the relevant i-x-j dimers. At each stage, only SLiMs with sufficient unrelated protein support are retained, making the algorithm very efficient.

**Figure 3. SLiMBuild Ambiguity.**
A. Wildcard ambiguity. Ambiguity is added in a multi-stage process. First, the motif is broken up into its component parts, consisting of alternate defined and wildcard positions. These are then replaced by the appropriate equivalency group, which in the case of wildcards is the full range of wildcard lengths from 0 up to the maximum length allowed. These equivalencies are then expanded to all possible variants. Any variants that do not themselves meet the minimum support requirement used previously for motif extension are not considered (shown in grey). Variants are only combined when the UPC support for the ambiguous motif is greater than for the individual variants. Variants that would not increase the UPC support of the original motif are therefore also removed (shown in red). The remaining variants are ranked (see text) and the best variant combined with the original motif (blue). The remaining variants are re-assessed for increasing UPC support and any failing to do so are again removed. If any remain, the ranking and combining cycle repeats. If not, the finished degenerate motif is returned. B. Amino acid ambiguities. These are handled in the same way as wildcard ambiguities, except that this time equivalencies are defined by the given equivalency list. If a given amino acid belongs to multiple equivalency groups, such as serine ([AGS] and [ST]) then all possible combinations of these equivalency groups (four in this case) are considered separately, thus multiple ambiguous SLiMs can potentially be produced. (Expansion of these combinations has been truncated in the figure.)

**Figure 4. SLiMFinder results on random datasets.**
A. Cumulative frequency of the most significant motifs returned by SLiMFinder for random datasets. Very little difference is observed between datasets produced using human amino acid frequencies and datasets of actual human protein sequences, implying that there is little or no bias introduced by regional compositional biases within real protein sequences. B. Box plots of most significant results returned by all random datasets for different dataset sizes (UPC). Although there is a slight trend for larger datasets to return smaller p-values, the difference is primarily restricted to the non-significant motifs. Variation between datasets of the same size is considerably greater than variation between different sized datasets.

See this image and copyright information in PMC

Cited by

A novel binding site on the cryptic intervening domain is a motif-dependent regulator of O-GlcNAc transferase.
Blankenship C, Xie J, Benz C, Wang A, Ivarsson Y, Jiang J. Blankenship C, et al. Res Sq [Preprint]. 2023 Feb 2:rs.3.rs-2531412. doi: 10.21203/rs.3.rs-2531412/v1. Res Sq. 2023. Update in: Nat Chem Biol. 2023 Nov;19(11):1423-1431. doi: 10.1038/s41589-023-01422-2. PMID: 36778302 Free PMC article. Updated. Preprint.
Predicting binding within disordered protein regions to structurally characterised peptide-binding domains.
Khan W, Duffy F, Pollastri G, Shields DC, Mooney C. Khan W, et al. PLoS One. 2013 Sep 3;8(9):e72838. doi: 10.1371/journal.pone.0072838. eCollection 2013. PLoS One. 2013. PMID: 24019881 Free PMC article.
PMS: a panoptic motif search tool.
Dinh H, Rajasekaran S. Dinh H, et al. PLoS One. 2013 Dec 4;8(12):e80660. doi: 10.1371/journal.pone.0080660. eCollection 2013. PLoS One. 2013. PMID: 24324619 Free PMC article.
Advanced computational approaches to understand protein aggregation.
Ghosh D, Biswas A, Radhakrishna M. Ghosh D, et al. Biophys Rev (Melville). 2024 Apr 24;5(2):021302. doi: 10.1063/5.0180691. eCollection 2024 Jun. Biophys Rev (Melville). 2024. PMID: 38681860 Free PMC article. Review.
Evaluating caveolin interactions: do proteins interact with the caveolin scaffolding domain through a widespread aromatic residue-rich motif?
Byrne DP, Dart C, Rigden DJ. Byrne DP, et al. PLoS One. 2012;7(9):e44879. doi: 10.1371/journal.pone.0044879. Epub 2012 Sep 17. PLoS One. 2012. PMID: 23028656 Free PMC article.

See all "Cited by" articles

References

1. Neduva V, Russell RB. Peptides mediating interaction networks: new leads at last. Curr Opin Biotechnol. 2006;17:465–471. - PubMed
1. Ceol A, Chatr-Aryamontri A, Santonico E, Sacco R, Castagnoli L, et al. DOMINO: a database of domain-peptide interactions. Nucleic Acids Res. 2006;29:29. - PMC - PubMed
1. Songyang Z, Cantley LC. SH2 domain specificity determination using oriented phosphopeptide library. Methods Enzymol. 1995;254:523–535. - PubMed
1. Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, et al. ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res. 2003;31:3625–3630. - PMC - PubMed
1. Neduva V, Linding R, Su-Angrand I, Stark A, de Masi F, et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol. 2005;3:e405. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins

Affiliation

SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources