Discovery of protein phosphorylation motifs through exploratory data analysis

Yi-Cheng Chen¹, Kripamoy Aguan, Chu-Wen Yang, Yao-Tsung Wang, Nikhil R Pal, I-Fang Chung

Affiliations

PMID: 21647451
PMCID: PMC3102080
DOI: 10.1371/journal.pone.0020025

Discovery of protein phosphorylation motifs through exploratory data analysis

Yi-Cheng Chen et al. PLoS One. 2011.

. 2011;6(5):e20025.

doi: 10.1371/journal.pone.0020025. Epub 2011 May 25.

Authors

Yi-Cheng Chen¹, Kripamoy Aguan, Chu-Wen Yang, Yao-Tsung Wang, Nikhil R Pal, I-Fang Chung

Affiliation

¹ Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.

PMID: 21647451
PMCID: PMC3102080
DOI: 10.1371/journal.pone.0020025

Abstract

Background: The need for efficient algorithms to uncover biologically relevant phosphorylation motifs has become very important with rapid expansion of the proteomic sequence database along with a plethora of new information on phosphorylation sites. Here we present a novel unsupervised method, called Motif Finder (in short, F-Motif) for identification of phosphorylation motifs. F-Motif uses clustering of sequence information represented by numerical features that exploit the statistical information hidden in some foreground data. Furthermore, these identified motifs are then filtered to find "actual" motifs with statistically significant motif scores.

Results and discussion: We have applied F-Motif to several new and existing data sets and compared its performance with two well known state-of-the-art methods. In almost all cases F-Motif could identify all statistically significant motifs extracted by the state-of-the-art methods. More importantly, in addition to this, F-Motif uncovers several novel motifs. We have demonstrated using clues from the literature that most of these new motifs discovered by F-Motif are indeed novel. We have also found some interesting phenomena. For example, for CK2 kinase, the conserved sites appear only on the right side of S. However, for CDK kinase, the adjacent site on the right of S is conserved with residue P. In addition, three different encoding methods, including a novel position contrast matrix (PCM) and the simplest binary coding, are used and the ability of F-motif to discover motifs remains quite robust with respect to encoding schemes.

Conclusions: An iterative algorithm proposed here uses exploratory data analysis to discover motifs from phosphorylated data. The effectiveness of F-Motif has been demonstrated using several real data sets as well as using a synthetic data set. The method is quite general in nature and can be used to find other types of motifs also. We have also provided a server for F-Motif at http://f-motif.classcloud.org/, http://bio.classcloud.org/f-motif/ or http://ymu.classcloud.org/f-motif/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. An illustration of extraction of different consensus sequence motifs by clustering process.**
A set of fixed length sequences are represented by a sequence logo. Sequence logo (a) represents all of the PKA kinase substrates. The sequences in (a) are split into several clusters. Each cluster is then represented by a sequence logo. Sequence logos (b)∼(g) represent PKA kinase substrates in different clusters.

**Figure 2. Overview of motif finding steps.**
In Step 1, for PCM we use the background data and foreground data, for PWM encoding, in place of the background data we use the entire Phospho.ELM database, while for binary encoding neither the foreground nor the background data are used. In Step 2 the k-means clustering algorithm is repeatedly used to generate a composite motif list (*CML*). This *CML* is then used to generate the final list of motifs in a stepwise manner ensuring two factors: statistical significance of the motif using a Binomial distribution based model, and frequency of occurrence of the motif in the present foreground data is at least M.

**Figure 3. An illustration of how a potential motif is extracted from a cluster.**
First, for every position the frequency of each residue is counted. Then for each position the residue with the highest frequency is noted. If more than one residue have the same highest frequency, one of them is randomly chosen. At the next stage of the process, sites with residues having frequency ≥T are considered conserved sites to generate a potential motif.

See this image and copyright information in PMC

References

1. Pinna LA, Ruzzene M. How do protein kinases recognize their substrates? Biochim Biophys Acta. 1996;1314:191–225. - PubMed
1. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. The protein kinase complement of the human genome. Science. 2002;298:1912–1934. - PubMed
1. Blom N, Sicheritz-Pontén T, Gupta R, Gammeltoft S, Brunak S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4:1633–1649. - PubMed
1. Pawson T, Scott JD. Protein phosphorylation in signaling—50 years and counting. Trends Biochem Sci. 2005;30:286–290. - PubMed
1. Gnad F, Ren S, Cox J, Olsen JV, Macek B, et al. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 2007;8:R250. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Discovery of protein phosphorylation motifs through exploratory data analysis

Affiliation

Discovery of protein phosphorylation motifs through exploratory data analysis

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources