Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Nov 15;5(11):e15411.
doi: 10.1371/journal.pone.0015411.

Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources

Affiliations

Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources

Tingting Li et al. PLoS One. .

Abstract

Phosphorylation is an important type of protein post-translational modification. Identification of possible phosphorylation sites of a protein is important for understanding its functions. Unbiased screening for phosphorylation sites by in vitro or in vivo experiments is time consuming and expensive; in silico prediction can provide functional candidates and help narrow down the experimental efforts. Most of the existing prediction algorithms take only the polypeptide sequence around the phosphorylation sites into consideration. However, protein phosphorylation is a very complex biological process in vivo. The polypeptide sequences around the potential sites are not sufficient to determine the phosphorylation status of those residues. In the current work, we integrated various data sources such as protein functional domains, protein subcellular location and protein-protein interactions, along with the polypeptide sequences to predict protein phosphorylation sites. The heterogeneous information significantly boosted the prediction accuracy for some kinase families. To demonstrate potential application of our method, we scanned a set of human proteins and predicted putative phosphorylation sites for Cyclin-dependent kinases, Casein kinase 2, Glycogen synthase kinase 3, Mitogen-activated protein kinases, protein kinase A, and protein kinase C families (available at http://cmbi.bjmu.edu.cn/huphospho). The predicted phosphorylation sites can serve as candidates for further experimental validation. Our strategy may also be applicable for the in silico identification of other post-translational modification substrates.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Background protein set (white) and known phosphorylation substrate (grey) score distributions for a) CDK and b) MAPK kinase families.
The horizontal axis is the log-odds ratio score and the vertical axis is the percentage of proteins with corresponding scores.
Figure 2
Figure 2. Workflow of the cross-validation test for each kinase family.
Before cross-validation, known phosphorylation sequences with higher than 70% sequence identity are removed. Then 4/5 of the positive samples are used as training data and the remaning 1/5 as testing data. Over-represented or under-represented functional features for the substrates of each kinase are got by hypergeometric distributions only based on the training data. The negative samples were randomly selected from the background set. To avoid high sequence similarities in the negative set, in the random selection process if the selected sequence has over 70% sequence identity with the previous selected sequences, it will be removed. The negative sample sizes were the same as the positive sample sizes and the proportion of the training and testing sets were still 4/5 and 1/5. Finally, for the same sample sets, different feature groups were integrated together and trained/tested one at a time. Here “sequence” represents sequence and structure features; “KEGG” represents sequence, structure and significant KEGG features; “GO BP” represents sequence, structure and significant GO Biological Process features; “GO CC” represents sequence, structure and significant GO Cellular Component features; “GO MF” represents sequence, structure and significant GO Molecular Function features; “PFAM” represents sequence, structure and significant Pfam domain features; “IPR” represents sequence, structure and significant InterPro domain features; “STRING” represents sequence, structure and significant STRING PPI features; “ALL” represents an integration of all the above features.

References

    1. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. The protein kinase complement of the human genome. Science. 2002;298:1912–1934. - PubMed
    1. Ubersax JA, Ferrell JE., Jr Mechanisms of specificity in protein phosphorylation. Nat Rev Mol Cell Biol. 2007;8:530–541. - PubMed
    1. Pinna LA, Ruzzene M. How do protein kinases recognize their substrates? Biochim Biophys Acta. 1996;1314:191–225. - PubMed
    1. Kreegipuu A, Blom N, Brunak S, Jarv J. Statistical analysis of protein kinase specificity determinants. FEBS Lett. 1998;430:45–50. - PubMed
    1. Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol. 1999;294:1351–1362. - PubMed

Publication types

MeSH terms