Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors

Satoshi Fukuchi¹, Keiichi Homma, Yoshiaki Minezaki, Takashi Gojobori, Ken Nishikawa

Affiliations

PMID: 19402914
PMCID: PMC2687452
DOI: 10.1186/1472-6807-9-26

Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors

Satoshi Fukuchi et al. BMC Struct Biol. 2009.

. 2009 Apr 30:9:26.

doi: 10.1186/1472-6807-9-26.

Authors

Satoshi Fukuchi¹, Keiichi Homma, Yoshiaki Minezaki, Takashi Gojobori, Ken Nishikawa

Affiliation

¹ Center for Information Biology & DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan. sfukuchi@genes.nig.ac.jp

PMID: 19402914
PMCID: PMC2687452
DOI: 10.1186/1472-6807-9-26

Abstract

Background: In addition to structural domains, most eukaryotic proteins possess intrinsically disordered (ID) regions. Although ID regions often play important functional roles, their accurate identification is difficult. As human transcription factors (TFs) constitute a typical group of proteins with long ID regions, we regarded them as a model of all proteins and attempted to accurately classify TFs into structural domains and ID regions. Although an extremely high fraction of ID regions besides DNA binding and/or other domains was detected in human TFs in our previous investigation, 20% of the residues were left unassigned. In this report, we exploit the generally higher sequence divergence in ID regions than in structural regions to completely divide proteins into structural domains and ID regions.

Results: The new dichotomic system first identifies domains of known structures, followed by assignment of structural domains and ID regions with a combination of pre-existing tools and a newly developed program based on sequence divergence, taking un-aligned regions into consideration. The system was found to be highly accurate: its application to a set of proteins with experimentally verified ID regions had an error rate as low as 2%. Application of this system to human TFs (401 proteins) showed that 38% of the residues were in structural domains, while 62% were in ID regions. The preponderance of ID regions makes a sharp contrast to TFs of Escherichia coli (229 proteins), in which only 5% fell in ID regions. The method also revealed that 4.0% and 11.8% of the total length in human and E. coli TFs, respectively, are comprised of structural domains whose structures have not been determined.

Conclusion: The present system verifies that sequence divergence including information of unaligned regions is a good indicator of ID regions. The system for the first time estimates the complete fractioning of structured/un-structured regions in human TFs, also revealing structural domains without homology to known structures. These predicted novel structural domains are good targets of structural genomics. When applied to other proteins, the system is expected to uncover more novel structural domains.

PubMed Disclaimer

Figures

**Figure 1**
**Sequence alignment pattern of a protein with a long ID region**. At the top, the domain structure, the sequence of the human androgen receptor (hAR), and a residue number scale are presented. Below them, some of the high-scoring homologues found by a BLAST search conducted with hAR as query against Swiss-Prot are presented, where the solid bars represent aligned segments and the dotted ones do un-aligned segments. The star signifies mammalian orthologues: the N-terminal sections of paralogues, such as the progesterone receptor and the glucocorticoid receptor, cannot be aligned to hAR. The bar representation output of the BLAST server was modified.

**Figure 2**
**Schematic illustration of the DICHOT system**. Structural domain and ID region assignments by different methods are presented at the top, the status boxes are displayed in the middle to illustrate the classifications after the corresponding steps, and a flow chart is shown in the lower half. Data processing proceeds from left to right. In the upper-most four rows, results of trans-membrane assignments, structural domain (SD) searches, DISOPRED2 prediction, and CLADIST prediction of a hypothetical query sequence are depicted, with the vertical dotted lines marking the N- and C- termini of the query. The blue, green, and purple bars respectively represent a trans-membrane region, regions structurally aligned by homology searches, and ID regions predicted by DISOPRED2, while the alternating purple and light green segments signify the ID regions and structural domains predicted by CLADIST, respectively. The red and gray bars stand for known domains and ID regions, respectively, while the orange section denotes a cryptic domain.

**Figure 3**
**Fractions of human and *E. coli* TFs occupied by structural domains and ID regions**. a) Overall statistics of structural domains and ID regions in human and *E. coli* TFs. The red, orange, and gray sectors represent the fractions of residues in known structural domains, cryptic domains, and ID regions, respectively. b) Histograms of TFs sorted according to fraction ranges occupied by structural domains and cumulative frequencies. The red bars show the frequency, while the black lines connecting dots represent the cumulative frequencies. The fractions of structural domains are plotted along the x axis. The scale on the left is for the number of TFs, while the right scale is for the cumulative frequency.

**Figure 4**
**Examples of structural domain and ID region assignments**. Structural domain and ID region assignments to four human TFs are presented. From top to bottom, each diagram consists of a scale with the total number of amino acid residues, assignments in the previous report, assignments in this study, and domain architecture from the literature. In the previous assignments, structural domains, ID regions, and un-assigned sections are presented in green, gray, and white, respectively. In the present assignments, domains of known structure, cryptic structural domains, and ID regions are respectively colored in red, orange, and dark gray. In the domain architecture derived from the literature, pink boxes represent DBD, while open rectangles and thick lines with letters stand for functional domains, which do not necessarily correspond to structural domains.

See this image and copyright information in PMC

References

1. Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z. Intrinsic disorder and protein function. Biochemistry. 2002;41:6573–6582. - PubMed
1. Wright PE, Dyson HJ. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999;293:321–331. - PubMed
1. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004;337:635–645. - PubMed
1. Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208. - PubMed
1. Iakoucheva LM, Brown CJ, Lawson JD, Obradovic Z, Dunker AK. Intrinsic disorder in cell-signaling and cancer-associated proteins. J Mol Biol. 2002;323:573–584. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors

Affiliation

Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials