Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Apr 30:9:26.
doi: 10.1186/1472-6807-9-26.

Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors

Affiliations

Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors

Satoshi Fukuchi et al. BMC Struct Biol. .

Abstract

Background: In addition to structural domains, most eukaryotic proteins possess intrinsically disordered (ID) regions. Although ID regions often play important functional roles, their accurate identification is difficult. As human transcription factors (TFs) constitute a typical group of proteins with long ID regions, we regarded them as a model of all proteins and attempted to accurately classify TFs into structural domains and ID regions. Although an extremely high fraction of ID regions besides DNA binding and/or other domains was detected in human TFs in our previous investigation, 20% of the residues were left unassigned. In this report, we exploit the generally higher sequence divergence in ID regions than in structural regions to completely divide proteins into structural domains and ID regions.

Results: The new dichotomic system first identifies domains of known structures, followed by assignment of structural domains and ID regions with a combination of pre-existing tools and a newly developed program based on sequence divergence, taking un-aligned regions into consideration. The system was found to be highly accurate: its application to a set of proteins with experimentally verified ID regions had an error rate as low as 2%. Application of this system to human TFs (401 proteins) showed that 38% of the residues were in structural domains, while 62% were in ID regions. The preponderance of ID regions makes a sharp contrast to TFs of Escherichia coli (229 proteins), in which only 5% fell in ID regions. The method also revealed that 4.0% and 11.8% of the total length in human and E. coli TFs, respectively, are comprised of structural domains whose structures have not been determined.

Conclusion: The present system verifies that sequence divergence including information of unaligned regions is a good indicator of ID regions. The system for the first time estimates the complete fractioning of structured/un-structured regions in human TFs, also revealing structural domains without homology to known structures. These predicted novel structural domains are good targets of structural genomics. When applied to other proteins, the system is expected to uncover more novel structural domains.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sequence alignment pattern of a protein with a long ID region. At the top, the domain structure, the sequence of the human androgen receptor (hAR), and a residue number scale are presented. Below them, some of the high-scoring homologues found by a BLAST search conducted with hAR as query against Swiss-Prot are presented, where the solid bars represent aligned segments and the dotted ones do un-aligned segments. The star signifies mammalian orthologues: the N-terminal sections of paralogues, such as the progesterone receptor and the glucocorticoid receptor, cannot be aligned to hAR. The bar representation output of the BLAST server was modified.
Figure 2
Figure 2
Schematic illustration of the DICHOT system. Structural domain and ID region assignments by different methods are presented at the top, the status boxes are displayed in the middle to illustrate the classifications after the corresponding steps, and a flow chart is shown in the lower half. Data processing proceeds from left to right. In the upper-most four rows, results of trans-membrane assignments, structural domain (SD) searches, DISOPRED2 prediction, and CLADIST prediction of a hypothetical query sequence are depicted, with the vertical dotted lines marking the N- and C- termini of the query. The blue, green, and purple bars respectively represent a trans-membrane region, regions structurally aligned by homology searches, and ID regions predicted by DISOPRED2, while the alternating purple and light green segments signify the ID regions and structural domains predicted by CLADIST, respectively. The red and gray bars stand for known domains and ID regions, respectively, while the orange section denotes a cryptic domain.
Figure 3
Figure 3
Fractions of human and E. coli TFs occupied by structural domains and ID regions. a) Overall statistics of structural domains and ID regions in human and E. coli TFs. The red, orange, and gray sectors represent the fractions of residues in known structural domains, cryptic domains, and ID regions, respectively. b) Histograms of TFs sorted according to fraction ranges occupied by structural domains and cumulative frequencies. The red bars show the frequency, while the black lines connecting dots represent the cumulative frequencies. The fractions of structural domains are plotted along the x axis. The scale on the left is for the number of TFs, while the right scale is for the cumulative frequency.
Figure 4
Figure 4
Examples of structural domain and ID region assignments. Structural domain and ID region assignments to four human TFs are presented. From top to bottom, each diagram consists of a scale with the total number of amino acid residues, assignments in the previous report, assignments in this study, and domain architecture from the literature. In the previous assignments, structural domains, ID regions, and un-assigned sections are presented in green, gray, and white, respectively. In the present assignments, domains of known structure, cryptic structural domains, and ID regions are respectively colored in red, orange, and dark gray. In the domain architecture derived from the literature, pink boxes represent DBD, while open rectangles and thick lines with letters stand for functional domains, which do not necessarily correspond to structural domains.

Similar articles

Cited by

References

    1. Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z. Intrinsic disorder and protein function. Biochemistry. 2002;41:6573–6582. - PubMed
    1. Wright PE, Dyson HJ. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999;293:321–331. - PubMed
    1. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004;337:635–645. - PubMed
    1. Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208. - PubMed
    1. Iakoucheva LM, Brown CJ, Lawson JD, Obradovic Z, Dunker AK. Intrinsic disorder in cell-signaling and cancer-associated proteins. J Mol Biol. 2002;323:573–584. - PubMed