. 2014 Sep 10;9(9):e106081.

doi: 10.1371/journal.pone.0106081. eCollection 2014.

Fast and accurate discovery of degenerate linear motifs in protein sequences

Abdellali Kelil¹, Benjamin Dubreuil¹, Emmanuel D Levy¹, Stephen W Michnick¹

Affiliations

Affiliation

¹ Département de Biochimie and Centre Robert-Cedergren, Bio-Informatique et Génomique, Université de Montréal, Succursale Centre-Ville, Montreal, Quebec, Canada.

PMID: 25207816
PMCID: PMC4160167
DOI: 10.1371/journal.pone.0106081

Fast and accurate discovery of degenerate linear motifs in protein sequences

Abdellali Kelil et al. PLoS One. 2014.

. 2014 Sep 10;9(9):e106081.

doi: 10.1371/journal.pone.0106081. eCollection 2014.

Authors

Abdellali Kelil¹, Benjamin Dubreuil¹, Emmanuel D Levy¹, Stephen W Michnick¹

Affiliation

¹ Département de Biochimie and Centre Robert-Cedergren, Bio-Informatique et Génomique, Université de Montréal, Succursale Centre-Ville, Montreal, Quebec, Canada.

PMID: 25207816
PMCID: PMC4160167
DOI: 10.1371/journal.pone.0106081

Abstract

Linear motifs mediate a wide variety of cellular functions, which makes their characterization in protein sequences crucial to understanding cellular systems. However, the short length and degenerate nature of linear motifs make their discovery a difficult problem. Here, we introduce MotifHound, an algorithm particularly suited for the discovery of small and degenerate linear motifs. MotifHound performs an exact and exhaustive enumeration of all motifs present in proteins of interest, including all of their degenerate forms, and scores the overrepresentation of each motif based on its occurrence in proteins of interest relative to a background (e.g., proteome) using the hypergeometric distribution. To assess MotifHound, we benchmarked it together with state-of-the-art algorithms. The benchmark consists of 11,880 sets of proteins from S. cerevisiae; in each set, we artificially spiked-in one motif varying in terms of three key parameters, (i) number of occurrences, (ii) length and (iii) the number of degenerate or "wildcard" positions. The benchmark enabled the evaluation of the impact of these three properties on the performance of the different algorithms. The results showed that MotifHound and SLiMFinder were the most accurate in detecting degenerate linear motifs. Interestingly, MotifHound was 15 to 20 times faster at comparable accuracy and performed best in the discovery of highly degenerate motifs. We complemented the benchmark by an analysis of proteins experimentally shown to bind the FUS1 SH3 domain from S. cerevisiae. Using the full-length protein partners as sole information, MotifHound recapitulated most experimentally determined motifs binding to the FUS1 SH3 domain. Moreover, these motifs exhibited properties typical of SH3 binding peptides, e.g., high intrinsic disorder and evolutionary conservation, despite the fact that none of these properties were used as prior information. MotifHound is available (http://michnick.bcm.umontreal.ca or http://tinyurl.com/motifhound) together with the benchmark that can be used as a reference to assess future developments in motif discovery.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Schematic description of the processing steps involved in motif discovery with MotifHound.**
(1) Given a set of protein sequences, (2) we enumerate all possible n-mers or motifs (3≤n≤12) present in this query dataset, (3) we then enumerate all degenerate forms of each motif, and we discard those present in less than 3 sequences of the query set. (4) All the motifs retained are counted in the proteome used as background. Note that query sequences are necessarily part of the proteome and are colored in red. (5) The statistics of each motif are k: number of occurrences in the query, l: number of occurrences in the proteome, p: number of sequences in the query, b: number of sequences in the proteome. These are written in a tabulated file used to evaluate the overrepresentation of each motif (6). The *P-value* reflecting the overrepresentation in the query set relative to the background is calculated by the cumulative hypergeometric distribution (see material and methods).

**Figure 2. Design of the benchmark datasets.**
The benchmark is composed of 11880 sets of 100 sequences, each set containing a specific motif spiked-in. The motifs spiked-in vary in terms of the three following parameters: their length, n, varying from 3 to 10 residues (8 values); their number of non-wildcard positions, D, varying from 3 to n (n-2 values per length, 36 values in total); and their number of occurrences in the set, N, equal to 3, 4, 5, 6, 7, 8, 10, 12, 15, 20 or 30 (11 values). For each combination of n, D and N, we created 30 replicates varying in the motif being spiked-in. Altogether, 1080 motifs were created for the benchmark (30 replicates ×36 masks), resulting in 11880 sets of 100 sequences (1080 motifs ×11 number of occurrences). A. We first create masks for each motif length in order to assign the wildcards and non-wildcard positions. ‘Ones’ indicate non-wildcard positions and ‘zeros’ indicate wildcard positions. The first and last positions are always non-wildcard, thus, n-2 masks are created for each length n, yielding 8+7+6+5+4+3+2+1 = 36 masks for lengths 10 to 3. B. In the second step, each mask is used to derive 30 unique motifs, by shuffling all positions (except the first and last) and replacing all non-wildcard positions by amino acids with frequencies drawn from the yeast proteome. In this example, 30 unique motifs are generated from a mask containing D = 6 non-wildcard positions. C. Finally, each motif so obtained is spiked-in once in N sequences from a set, each composed of one hundred yeast protein sequences randomly sampled. The orange rectangles symbolize the motifs spiked in, and the blue lines represent sequences. In this example, the motif has been inserted either 3, 4, 5, 6, 7, 8, 10, 12, 15, 20 or 30 times in the same dataset of 100 sequences.

**Figure 3. Comparative analysis of different algorithms in the discovery of degenerate linear motifs.**
A. Each graph shows the motif detection accuracy (y-axis) as a function of the number of sequences N where the motif was spiked-in (x-axis, number 12 and 20 are not shown). Each graph shows the results for a motif of length n (columns) and with 3 to 10 non-wildcard positions (*i.e.* 0 to 7 wildcards) (rows). Colors correspond to the methods tested (red: MotifHound, blue: MEME, green: SLiMFinder and purple: TEIRESIAS). The accuracy is assessed by the capacity of each method to recover the motifs spiked-in the datasets. For each combination of parameters (length, number of wildcards, number of occurrences), there are 30 unique motifs inserted, each in a unique set of 100 sequences. Thus, each motif correctly identified increases the accuracy by 3.33%, and an accuracy of 100% means that the motifs in all 30 replicates were correctly identified as being the most significant. B. Statistics on length and number of wildcard positions for biological motifs extracted from the Human Protein Resource Database (HPRD), the Eukaryotic Linear Motif database (ELM) and the MiniMotif resource . We present the statistics for 5 groups of motifs defined according to the number D of non-wildcard positions that they contain (D<3; 3≤D≤5; D = 6; 7≤D≤10; D>10). C. Barplot of running times in seconds for three case-scenarios representing different levels of motif-search complexity. The best scenario is a short well-defined motif (length 4 and 1 wildcard), the worst scenario is a long and degenerate motif (length 10 and 6 wildcards), and an intermediate scenario corresponds to a long and little degenerate motif (length 10 and 3 wildcards). Each color corresponds to a method, as in (A).

**Figure 4. Detection of linear motifs in 22 protein targets known to bind the FUS1 SH3 domain.**
A. FUS1 is a yeast protein involved in mating. The SH3 domain of FUS1 is known to bind 22 proteins, through 25 binding sites that cover 2.15% of their total sequence length. We blindly submitted the 22 protein sequences to several algorithms for them to detect the binding sites. Two types of algorithms were considered, motif-based algorithms, which detect specific overrepresented motifs, and regions-based algorithms, which detect regions predicted to encode any linear motif. We then considered the top motifs (or regions) returned for each algorithm, such that they covered at most 2.15% of sequence's length. For each algorithm tested, we plot the coverage of the motifs identified, as well as the corresponding coverage of known SH3 binding sites identified. B. Among motifs experimentally characterized to mediate the recognition between FUS1 and its partners, some motifs correspond to the known consensus R[ST][ST]SL and others do not. Here we plot the fraction of coverage for both types, showing that MotifHound is particularly able to identify non-consensus sequences. Together, the results in panels A and B show that the measure of “*motif enrichment*” introduced with MotifHound enables the accurate detection of functional linear motifs, and is in fact the best in this case. C. We know that linear motifs tend to exhibit specific biological signatures. They indeed tend to be conserved, they tend to appear in solvent-accessible as well as in disordered regions, and in the case of SH3 recognition motifs they should exhibit favourable free energies of association with the SH3 domain. We compared these properties for motifs identified using the different methods, showing that those returned by MotifHound consistently reflect these properties.

See this image and copyright information in PMC

Cited by

FaSTPACE: a fast and scalable tool for peptide alignment and consensus extraction.
Kotb HM, Davey NE. Kotb HM, et al. NAR Genom Bioinform. 2024 Aug 21;6(3):lqae103. doi: 10.1093/nargab/lqae103. eCollection 2024 Sep. NAR Genom Bioinform. 2024. PMID: 39170861 Free PMC article.
High-throughput methods for identification of protein-protein interactions involving short linear motifs.
Blikstad C, Ivarsson Y. Blikstad C, et al. Cell Commun Signal. 2015 Aug 22;13:38. doi: 10.1186/s12964-015-0116-8. Cell Commun Signal. 2015. PMID: 26297553 Free PMC article.
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).
Asgari E, McHardy AC, Mofrad MRK. Asgari E, et al. Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w. Sci Rep. 2019. PMID: 30837494 Free PMC article.
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons.
Prytuliak R, Volkmer M, Meier M, Habermann BH. Prytuliak R, et al. Nucleic Acids Res. 2017 Jul 3;45(W1):W470-W477. doi: 10.1093/nar/gkx341. Nucleic Acids Res. 2017. PMID: 28460141 Free PMC article.
A bioinformatics pipeline to search functional motifs within whole-proteome data: a case study of poxviruses.
Sobhy H. Sobhy H. Virus Genes. 2017 Apr;53(2):173-178. doi: 10.1007/s11262-016-1416-9. Epub 2016 Dec 20. Virus Genes. 2017. PMID: 28000080 Free PMC article.

See all "Cited by" articles

References

1. Diella F, Haslam N, Chica C, Budd A, Michael S, et al. (2008) Understanding eukaryotic linear motifs and their role in cell signaling and regulation. Front Biosci 13: 6580–6603. - PubMed
1. Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, et al. (2012) Attributes of short linear motifs. Mol Biosyst 8: 268–281. - PubMed
1. Davey NE, Edwards RJ, Shields DC (2010) Computational identification and analysis of protein short linear motifs. Front Biosci 15: 801–825. - PubMed
1. Marsico A, Scheubert K, Tuukkanen A, Henschel A, Winter C, et al. (2010) MeMotif: a database of linear motifs in alpha-helical transmembrane proteins. Nucleic Acids Res 38: D181–189. - PMC - PubMed
1. Van Roey K, Gibson TJ, Davey NE (2012) Motif switches: decision-making in cell regulation. Curr Opin Struct Biol 22: 378–385. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- Saccharomyces Genome Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast and accurate discovery of degenerate linear motifs in protein sequences

Affiliation

Fast and accurate discovery of degenerate linear motifs in protein sequences

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials