. 2010 Sep 15;5(9):e12460.

doi: 10.1371/journal.pone.0012460.

Classification of protein kinases on the basis of both kinase and non-kinase regions

Juliette Martin¹, Krishanpal Anamika, Narayanaswamy Srinivasan

Affiliations

PMID: 20856812
PMCID: PMC2939887
DOI: 10.1371/journal.pone.0012460

Classification of protein kinases on the basis of both kinase and non-kinase regions

Juliette Martin et al. PLoS One. 2010.

. 2010 Sep 15;5(9):e12460.

doi: 10.1371/journal.pone.0012460.

Authors

Juliette Martin¹, Krishanpal Anamika, Narayanaswamy Srinivasan

Affiliation

¹ Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India. juliette.martin@ibcp.fr

PMID: 20856812
PMCID: PMC2939887
DOI: 10.1371/journal.pone.0012460

Abstract

Background: Protein phosphorylation is a generic way to regulate signal transduction pathways in all kingdoms of life. In many organisms, it is achieved by the large family of Ser/Thr/Tyr protein kinases which are traditionally classified into groups and subfamilies on the basis of the amino acid sequence of their catalytic domains. Many protein kinases are multi-domain in nature but the diversity of the accessory domains and their organization are usually not taken into account while classifying kinases into groups or subfamilies.

Methodology: Here, we present an approach which considers amino acid sequences of complete gene products, in order to suggest refinements in sets of pre-classified sequences. The strategy is based on alignment-free similarity scores and iterative Area Under the Curve (AUC) computation. Similarity scores are computed by detecting common patterns between two sequences and scoring them using a substitution matrix, with a consistent normalization scheme. This allows us to handle full-length sequences, and implicitly takes into account domain diversity and domain shuffling. We quantitatively validate our approach on a subset of 212 human protein kinases. We then employ it on the complete repertoire of human protein kinases and suggest few qualitative refinements in the subfamily assignment stored in the KinG database, which is based on catalytic domains only. Based on our new measure, we delineate 37 cases of potential hybrid kinases: sequences for which classical classification based entirely on catalytic domains is inconsistent with the full-length similarity scores computed here, which implicitly consider multi-domain nature and regions outside the catalytic kinase domain. We also provide some examples of hybrid kinases of the protozoan parasite Entamoeba histolytica.

Conclusions: The implicit consideration of multi-domain architectures is a valuable inclusion to complement other classification schemes. The proposed algorithm may also be employed to classify other families of enzymes with multi-domain architecture.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Sequence alignment of selected protein pairs.**
A: Sequences ENSP00000266970 and ENSP00000293215. B: Sequences ENSP00000281821 and ENSP00000350896. Identities are indicated by black background. Pfam domains are indicated by colored boxes: red = catalytic domains, magenta = domains detected in both proteins, blue = domain detected in only one protein. Abbreviations used: Ephrin_lbd = Ephrin receptor ligand binding domain, GCC = GCC2 and GCC3 domain, fn3 = fibronectin type III domain, Pkinase = protein kinase domain, SAM = sterile alpha motif domain (type SAM_1 is detected in ENSP00000350896 and type SAM_2 is detected in ENSP00000281821). Global sequence alignment is obtained using the Needleman-Wunsch algorithm. For each pair of sequences, the different distances are indicated at the bottom of the alignment. Image generated using ESPript software .

**Figure 2. Comparison between the different distances computed between protein sequences of the validation data set.**
LMScat: LMS distances between catalytic domains, LMSfull: LMS distances between full-length sequences, IDcat: identity distances between catalytic domains, BLOSUMcat: BLOSUM distances between catalytic domains, BLOSUMfull: BLOSUM distances between full-length sequences. The lower panel reports the Spearman rank correlation coefficients between different distances.

**Figure 3. Assessment of different distances to detect homogeneous clusters in the validation data set.**
A: each distance matrix is used as input to hierarchical clustering; clusters are extracted from the resulting trees and assessed by the biological homogeneity index (BHI); B: evolution of BHI according to the number of clusters. LMScat: LMS distances computed from catalytic domains, LMSfull: LMS distances computed from full-length sequences, BLOSUMcat: Blosum distances computed from catalytic domains, BLOSUMfull: Blosum distances computed from full-length sequences, IDcat: identity distances computed from catalytic domains. Horizontal and vertical lines indicate respectively BHI = 1 and number of clusters equal to 17.

**Figure 4. AUC distributions obtained on the human kinome.**
A: AUC obtained using the iterative procedure starting from BLOSUM full-length distances, B: AUC obtained using the iterative procedure starting from LMS full-length distances. The vertical red line indicates the cut-off for the detection of hybrid kinases.

**Figure 5. Computation of Local Matching Score (LMS) between two sequences without alignment.**

**Figure 6. Detection of outliers in a pre-classified data set.**
1: the distance matrix and initial weights are used to compute AUC values for each sequence using equation 6; 2: sequences weights are updated using equation 7; 3: the procedure is iterated until convergence; 4: the final AUC values are used to compute a histogram; 5: the histogram shape is used to detect outliers.

**Figure 7. Examples of classification curves.**
A: a putatively well classified sequence, B: a putatively misclassified sequence. AUC denotes the area under the curve.

See this image and copyright information in PMC

References

1. Krupa A, Abhinandan K, Srinivasan N. KinG: a database of protein kinases in genomes. Nucleic Acids Res. 2004;32:D153–155. - PMC - PubMed
1. Cohen P. Protein kinases–the major drug targets of the twenty-first century? Nat Rev Drug Discov. 2002;1:309–315. - PubMed
1. Han E, McGonigal T. Role of focal adhesion kinase in human cancer: a potential target for drug discovery. Anticancer Agents Med Chem. 2007;7:681–684. - PubMed
1. Hardie D. AMP-activated protein kinase as a drug target. Annu Rev Pharmacol Toxicol. 2007;47:185–210. - PubMed
1. Hunter T, Plowman G. The protein kinases of budding yeast: six score and more. Trends Biochem Sci. 1997;22:18–22. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Classification of protein kinases on the basis of both kinase and non-kinase regions

Affiliation

Classification of protein kinases on the basis of both kinase and non-kinase regions

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources