Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources

Tingting Li¹, Pufeng Du, Nanfang Xu

Affiliations

PMID: 21085571
PMCID: PMC2981550
DOI: 10.1371/journal.pone.0015411

Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources

Tingting Li et al. PLoS One. 2010.

. 2010 Nov 15;5(11):e15411.

doi: 10.1371/journal.pone.0015411.

Authors

Tingting Li¹, Pufeng Du, Nanfang Xu

Affiliation

¹ Department of Biomedical Informatics, Peking University Health Science Center, Beijing, China. litt@hsc.pku.edu.cn

PMID: 21085571
PMCID: PMC2981550
DOI: 10.1371/journal.pone.0015411

Abstract

Phosphorylation is an important type of protein post-translational modification. Identification of possible phosphorylation sites of a protein is important for understanding its functions. Unbiased screening for phosphorylation sites by in vitro or in vivo experiments is time consuming and expensive; in silico prediction can provide functional candidates and help narrow down the experimental efforts. Most of the existing prediction algorithms take only the polypeptide sequence around the phosphorylation sites into consideration. However, protein phosphorylation is a very complex biological process in vivo. The polypeptide sequences around the potential sites are not sufficient to determine the phosphorylation status of those residues. In the current work, we integrated various data sources such as protein functional domains, protein subcellular location and protein-protein interactions, along with the polypeptide sequences to predict protein phosphorylation sites. The heterogeneous information significantly boosted the prediction accuracy for some kinase families. To demonstrate potential application of our method, we scanned a set of human proteins and predicted putative phosphorylation sites for Cyclin-dependent kinases, Casein kinase 2, Glycogen synthase kinase 3, Mitogen-activated protein kinases, protein kinase A, and protein kinase C families (available at http://cmbi.bjmu.edu.cn/huphospho). The predicted phosphorylation sites can serve as candidates for further experimental validation. Our strategy may also be applicable for the in silico identification of other post-translational modification substrates.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Background protein set (white) and known phosphorylation substrate (grey) score distributions for a) CDK and b) MAPK kinase families.**
The horizontal axis is the log-odds ratio score and the vertical axis is the percentage of proteins with corresponding scores.

**Figure 2. Workflow of the cross-validation test for each kinase family.**
Before cross-validation, known phosphorylation sequences with higher than 70% sequence identity are removed. Then 4/5 of the positive samples are used as training data and the remaning 1/5 as testing data. Over-represented or under-represented functional features for the substrates of each kinase are got by hypergeometric distributions only based on the training data. The negative samples were randomly selected from the background set. To avoid high sequence similarities in the negative set, in the random selection process if the selected sequence has over 70% sequence identity with the previous selected sequences, it will be removed. The negative sample sizes were the same as the positive sample sizes and the proportion of the training and testing sets were still 4/5 and 1/5. Finally, for the same sample sets, different feature groups were integrated together and trained/tested one at a time. Here “sequence” represents sequence and structure features; “KEGG” represents sequence, structure and significant KEGG features; “GO BP” represents sequence, structure and significant GO Biological Process features; “GO CC” represents sequence, structure and significant GO Cellular Component features; “GO MF” represents sequence, structure and significant GO Molecular Function features; “PFAM” represents sequence, structure and significant Pfam domain features; “IPR” represents sequence, structure and significant InterPro domain features; “STRING” represents sequence, structure and significant STRING PPI features; “ALL” represents an integration of all the above features.

See this image and copyright information in PMC

References

1. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. The protein kinase complement of the human genome. Science. 2002;298:1912–1934. - PubMed
1. Ubersax JA, Ferrell JE., Jr Mechanisms of specificity in protein phosphorylation. Nat Rev Mol Cell Biol. 2007;8:530–541. - PubMed
1. Pinna LA, Ruzzene M. How do protein kinases recognize their substrates? Biochim Biophys Acta. 1996;1314:191–225. - PubMed
1. Kreegipuu A, Blom N, Brunak S, Jarv J. Statistical analysis of protein kinase specificity determinants. FEBS Lett. 1998;430:45–50. - PubMed
1. Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol. 1999;294:1351–1362. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources

Affiliation

Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources