Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 10;9(9):e106081.
doi: 10.1371/journal.pone.0106081. eCollection 2014.

Fast and accurate discovery of degenerate linear motifs in protein sequences

Affiliations

Fast and accurate discovery of degenerate linear motifs in protein sequences

Abdellali Kelil et al. PLoS One. .

Abstract

Linear motifs mediate a wide variety of cellular functions, which makes their characterization in protein sequences crucial to understanding cellular systems. However, the short length and degenerate nature of linear motifs make their discovery a difficult problem. Here, we introduce MotifHound, an algorithm particularly suited for the discovery of small and degenerate linear motifs. MotifHound performs an exact and exhaustive enumeration of all motifs present in proteins of interest, including all of their degenerate forms, and scores the overrepresentation of each motif based on its occurrence in proteins of interest relative to a background (e.g., proteome) using the hypergeometric distribution. To assess MotifHound, we benchmarked it together with state-of-the-art algorithms. The benchmark consists of 11,880 sets of proteins from S. cerevisiae; in each set, we artificially spiked-in one motif varying in terms of three key parameters, (i) number of occurrences, (ii) length and (iii) the number of degenerate or "wildcard" positions. The benchmark enabled the evaluation of the impact of these three properties on the performance of the different algorithms. The results showed that MotifHound and SLiMFinder were the most accurate in detecting degenerate linear motifs. Interestingly, MotifHound was 15 to 20 times faster at comparable accuracy and performed best in the discovery of highly degenerate motifs. We complemented the benchmark by an analysis of proteins experimentally shown to bind the FUS1 SH3 domain from S. cerevisiae. Using the full-length protein partners as sole information, MotifHound recapitulated most experimentally determined motifs binding to the FUS1 SH3 domain. Moreover, these motifs exhibited properties typical of SH3 binding peptides, e.g., high intrinsic disorder and evolutionary conservation, despite the fact that none of these properties were used as prior information. MotifHound is available (http://michnick.bcm.umontreal.ca or http://tinyurl.com/motifhound) together with the benchmark that can be used as a reference to assess future developments in motif discovery.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic description of the processing steps involved in motif discovery with MotifHound.
(1) Given a set of protein sequences, (2) we enumerate all possible n-mers or motifs (3≤n≤12) present in this query dataset, (3) we then enumerate all degenerate forms of each motif, and we discard those present in less than 3 sequences of the query set. (4) All the motifs retained are counted in the proteome used as background. Note that query sequences are necessarily part of the proteome and are colored in red. (5) The statistics of each motif are k: number of occurrences in the query, l: number of occurrences in the proteome, p: number of sequences in the query, b: number of sequences in the proteome. These are written in a tabulated file used to evaluate the overrepresentation of each motif (6). The P-value reflecting the overrepresentation in the query set relative to the background is calculated by the cumulative hypergeometric distribution (see material and methods).
Figure 2
Figure 2. Design of the benchmark datasets.
The benchmark is composed of 11880 sets of 100 sequences, each set containing a specific motif spiked-in. The motifs spiked-in vary in terms of the three following parameters: their length, n, varying from 3 to 10 residues (8 values); their number of non-wildcard positions, D, varying from 3 to n (n-2 values per length, 36 values in total); and their number of occurrences in the set, N, equal to 3, 4, 5, 6, 7, 8, 10, 12, 15, 20 or 30 (11 values). For each combination of n, D and N, we created 30 replicates varying in the motif being spiked-in. Altogether, 1080 motifs were created for the benchmark (30 replicates ×36 masks), resulting in 11880 sets of 100 sequences (1080 motifs ×11 number of occurrences). A. We first create masks for each motif length in order to assign the wildcards and non-wildcard positions. ‘Ones’ indicate non-wildcard positions and ‘zeros’ indicate wildcard positions. The first and last positions are always non-wildcard, thus, n-2 masks are created for each length n, yielding 8+7+6+5+4+3+2+1 = 36 masks for lengths 10 to 3. B. In the second step, each mask is used to derive 30 unique motifs, by shuffling all positions (except the first and last) and replacing all non-wildcard positions by amino acids with frequencies drawn from the yeast proteome. In this example, 30 unique motifs are generated from a mask containing D = 6 non-wildcard positions. C. Finally, each motif so obtained is spiked-in once in N sequences from a set, each composed of one hundred yeast protein sequences randomly sampled. The orange rectangles symbolize the motifs spiked in, and the blue lines represent sequences. In this example, the motif has been inserted either 3, 4, 5, 6, 7, 8, 10, 12, 15, 20 or 30 times in the same dataset of 100 sequences.
Figure 3
Figure 3. Comparative analysis of different algorithms in the discovery of degenerate linear motifs.
A. Each graph shows the motif detection accuracy (y-axis) as a function of the number of sequences N where the motif was spiked-in (x-axis, number 12 and 20 are not shown). Each graph shows the results for a motif of length n (columns) and with 3 to 10 non-wildcard positions (i.e. 0 to 7 wildcards) (rows). Colors correspond to the methods tested (red: MotifHound, blue: MEME, green: SLiMFinder and purple: TEIRESIAS). The accuracy is assessed by the capacity of each method to recover the motifs spiked-in the datasets. For each combination of parameters (length, number of wildcards, number of occurrences), there are 30 unique motifs inserted, each in a unique set of 100 sequences. Thus, each motif correctly identified increases the accuracy by 3.33%, and an accuracy of 100% means that the motifs in all 30 replicates were correctly identified as being the most significant. B. Statistics on length and number of wildcard positions for biological motifs extracted from the Human Protein Resource Database (HPRD), the Eukaryotic Linear Motif database (ELM) and the MiniMotif resource . We present the statistics for 5 groups of motifs defined according to the number D of non-wildcard positions that they contain (D<3; 3≤D≤5; D = 6; 7≤D≤10; D>10). C. Barplot of running times in seconds for three case-scenarios representing different levels of motif-search complexity. The best scenario is a short well-defined motif (length 4 and 1 wildcard), the worst scenario is a long and degenerate motif (length 10 and 6 wildcards), and an intermediate scenario corresponds to a long and little degenerate motif (length 10 and 3 wildcards). Each color corresponds to a method, as in (A).
Figure 4
Figure 4. Detection of linear motifs in 22 protein targets known to bind the FUS1 SH3 domain.
A. FUS1 is a yeast protein involved in mating. The SH3 domain of FUS1 is known to bind 22 proteins, through 25 binding sites that cover 2.15% of their total sequence length. We blindly submitted the 22 protein sequences to several algorithms for them to detect the binding sites. Two types of algorithms were considered, motif-based algorithms, which detect specific overrepresented motifs, and regions-based algorithms, which detect regions predicted to encode any linear motif. We then considered the top motifs (or regions) returned for each algorithm, such that they covered at most 2.15% of sequence's length. For each algorithm tested, we plot the coverage of the motifs identified, as well as the corresponding coverage of known SH3 binding sites identified. B. Among motifs experimentally characterized to mediate the recognition between FUS1 and its partners, some motifs correspond to the known consensus R[ST][ST]SL and others do not. Here we plot the fraction of coverage for both types, showing that MotifHound is particularly able to identify non-consensus sequences. Together, the results in panels A and B show that the measure of “motif enrichment” introduced with MotifHound enables the accurate detection of functional linear motifs, and is in fact the best in this case. C. We know that linear motifs tend to exhibit specific biological signatures. They indeed tend to be conserved, they tend to appear in solvent-accessible as well as in disordered regions, and in the case of SH3 recognition motifs they should exhibit favourable free energies of association with the SH3 domain. We compared these properties for motifs identified using the different methods, showing that those returned by MotifHound consistently reflect these properties.

Similar articles

Cited by

References

    1. Diella F, Haslam N, Chica C, Budd A, Michael S, et al. (2008) Understanding eukaryotic linear motifs and their role in cell signaling and regulation. Front Biosci 13: 6580–6603. - PubMed
    1. Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, et al. (2012) Attributes of short linear motifs. Mol Biosyst 8: 268–281. - PubMed
    1. Davey NE, Edwards RJ, Shields DC (2010) Computational identification and analysis of protein short linear motifs. Front Biosci 15: 801–825. - PubMed
    1. Marsico A, Scheubert K, Tuukkanen A, Henschel A, Winter C, et al. (2010) MeMotif: a database of linear motifs in alpha-helical transmembrane proteins. Nucleic Acids Res 38: D181–189. - PMC - PubMed
    1. Van Roey K, Gibson TJ, Davey NE (2012) Motif switches: decision-making in cell regulation. Curr Opin Struct Biol 22: 378–385. - PubMed

Publication types