Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(5):e20025.
doi: 10.1371/journal.pone.0020025. Epub 2011 May 25.

Discovery of protein phosphorylation motifs through exploratory data analysis

Affiliations

Discovery of protein phosphorylation motifs through exploratory data analysis

Yi-Cheng Chen et al. PLoS One. 2011.

Abstract

Background: The need for efficient algorithms to uncover biologically relevant phosphorylation motifs has become very important with rapid expansion of the proteomic sequence database along with a plethora of new information on phosphorylation sites. Here we present a novel unsupervised method, called Motif Finder (in short, F-Motif) for identification of phosphorylation motifs. F-Motif uses clustering of sequence information represented by numerical features that exploit the statistical information hidden in some foreground data. Furthermore, these identified motifs are then filtered to find "actual" motifs with statistically significant motif scores.

Results and discussion: We have applied F-Motif to several new and existing data sets and compared its performance with two well known state-of-the-art methods. In almost all cases F-Motif could identify all statistically significant motifs extracted by the state-of-the-art methods. More importantly, in addition to this, F-Motif uncovers several novel motifs. We have demonstrated using clues from the literature that most of these new motifs discovered by F-Motif are indeed novel. We have also found some interesting phenomena. For example, for CK2 kinase, the conserved sites appear only on the right side of S. However, for CDK kinase, the adjacent site on the right of S is conserved with residue P. In addition, three different encoding methods, including a novel position contrast matrix (PCM) and the simplest binary coding, are used and the ability of F-motif to discover motifs remains quite robust with respect to encoding schemes.

Conclusions: An iterative algorithm proposed here uses exploratory data analysis to discover motifs from phosphorylated data. The effectiveness of F-Motif has been demonstrated using several real data sets as well as using a synthetic data set. The method is quite general in nature and can be used to find other types of motifs also. We have also provided a server for F-Motif at http://f-motif.classcloud.org/, http://bio.classcloud.org/f-motif/ or http://ymu.classcloud.org/f-motif/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. An illustration of extraction of different consensus sequence motifs by clustering process.
A set of fixed length sequences are represented by a sequence logo. Sequence logo (a) represents all of the PKA kinase substrates. The sequences in (a) are split into several clusters. Each cluster is then represented by a sequence logo. Sequence logos (b)∼(g) represent PKA kinase substrates in different clusters.
Figure 2
Figure 2. Overview of motif finding steps.
In Step 1, for PCM we use the background data and foreground data, for PWM encoding, in place of the background data we use the entire Phospho.ELM database, while for binary encoding neither the foreground nor the background data are used. In Step 2 the k-means clustering algorithm is repeatedly used to generate a composite motif list (CML). This CML is then used to generate the final list of motifs in a stepwise manner ensuring two factors: statistical significance of the motif using a Binomial distribution based model, and frequency of occurrence of the motif in the present foreground data is at least M.
Figure 3
Figure 3. An illustration of how a potential motif is extracted from a cluster.
First, for every position the frequency of each residue is counted. Then for each position the residue with the highest frequency is noted. If more than one residue have the same highest frequency, one of them is randomly chosen. At the next stage of the process, sites with residues having frequency ≥T are considered conserved sites to generate a potential motif.

References

    1. Pinna LA, Ruzzene M. How do protein kinases recognize their substrates? Biochim Biophys Acta. 1996;1314:191–225. - PubMed
    1. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. The protein kinase complement of the human genome. Science. 2002;298:1912–1934. - PubMed
    1. Blom N, Sicheritz-Pontén T, Gupta R, Gammeltoft S, Brunak S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4:1633–1649. - PubMed
    1. Pawson T, Scott JD. Protein phosphorylation in signaling—50 years and counting. Trends Biochem Sci. 2005;30:286–290. - PubMed
    1. Gnad F, Ren S, Cox J, Olsen JV, Macek B, et al. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 2007;8:R250. - PMC - PubMed

Publication types