Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 14;14(5):e8190.
doi: 10.15252/msb.20188190.

High-throughput discovery of functional disordered regions: investigation of transactivation domains

Affiliations

High-throughput discovery of functional disordered regions: investigation of transactivation domains

Charles Nj Ravarani et al. Mol Syst Biol. .

Abstract

Over 40% of proteins in any eukaryotic genome encode intrinsically disordered regions (IDRs) that do not adopt defined tertiary structures. Certain IDRs perform critical functions, but discovering them is non-trivial as the biological context determines their function. We present IDR-Screen, a framework to discover functional IDRs in a high-throughput manner by simultaneously assaying large numbers of DNA sequences that code for short disordered sequences. Functionality-conferring patterns in their protein sequence are inferred through statistical learning. Using yeast HSF1 transcription factor-based assay, we discovered IDRs that function as transactivation domains (TADs) by screening a random sequence library and a designed library consisting of variants of 13 diverse TADs. Using machine learning, we find that segments devoid of positively charged residues but with redundant short sequence patterns of negatively charged and aromatic residues are a generic feature for TAD functionality. We anticipate that investigating defined sequence libraries using IDR-Screen for specific functions can facilitate discovering novel and functional regions of the disordered proteome as well as understand the impact of natural and disease variants in disordered segments.

Keywords: high‐throughput screen; intrinsically disordered protein; machine learning; mutational scanning; transactivation domain.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Outline of IDR‐Screen
IDR‐Screen consists of a modular set of stages that can broadly be grouped into the experimental and computational phases. A library of random or designed sequences is transformed into a cell population, expressed as a part of a protein that is used for selection (survival or other readouts such as fluorescence). In this manner, the library is screened to discover sequences that are functional/non‐functional based on the designed assay. Upon data processing, this dataset of experimentally validated functional and non‐functional sequences are analyzed to learn the rules of functionality using machine‐learning (ML) approaches.
Figure 2
Figure 2. Analysis of functional and non‐functional sequences from the random library
  1. A

    Enrichment and depletion of different amino acids in the random library (log2 of frequencies of functional over non‐functional sequences).

  2. B–G

    Boxplots of the distribution of the values of length (B), pI (C), hydrophobicity (D), disorder content (E), and helicity (F) for sequences that are functional (green) and non‐functional (red). In the boxplots, the central line shows the median. Statistical significance was assessed using Wilcoxon test, n values (sample size) and P‐values are provided on the right. (G) Enrichment of 9‐aa TAD motif in functional versus non‐function sequences; ratio of with‐to‐without 9‐aa TAD in functional‐to‐non‐functional sequences (219/520)/(13384/50001).

Figure 3
Figure 3. The top 10 most important features of the machine‐learning models trained on the random library
Schematic describing the sequence space explored by the random library (left). Table listing the top 10 most important features. The relative feature importance is given as relative percentages in the last four columns. The size of the circles is scaled per method (lasso, ridge, xgboost, stacked). The direction column denotes the direction of enrichment of the given feature for functional sequences compared to non‐functional sequences (up, positive direction and down, negative direction). This figure provides a simplified description of the actual features, which are available in Table EV3.
Figure 4
Figure 4. Mutational scanning of naturally occurring TADs and the top 10 features of the machine‐learning models trained on the design library
  1. Heatmap of the tolerance to amino acid substitutions in WT transactivation domain sequences. The tolerance of a mutation is defined as the fraction of functional sequences over all the sequences when a specific substitution was performed. The columns (amino acid in a WT TAD that is substituted) are ordered according to decreasing tolerance (from left to right), and the rows (amino acid into which a residue in the WT TAD is substituted for) are ordered according to decreasing tolerance (from bottom to top). The cells are colored on a green to red gradient for high to low tolerance, respectively. Empty tiles represent data points not detected in the library.

  2. Schematic describing the sequence space explored by the design library (left). Table listing the top 10 most important features. The relative feature importance is given as relative percentages in the last four columns. The size of the circles is scaled per method (lasso, ridge, xgboost, stacked). The direction column denotes the direction of enrichment of the given feature for functional sequences compared to non‐functional sequences (up, positive direction and down, negative direction). This figure provides a simplified description of the actual features, which are available in Table EV7.

Figure 5
Figure 5. The top 10 most important features of the machine‐learning models trained on the combined library
Schematic describing the sequence space explored by the combined library (left). Table listing the top 10 most important features. The relative feature importance is given as relative percentages in the last four columns. The size of the circles is scaled per method (lasso, ridge, xgboost, stacked). The direction column denotes the direction of enrichment of the given feature for functional sequences compared to non‐functional sequences (up, positive direction and down, negative direction). This figure provides a simplified description of the actual features, which are available in Table EV8.
Figure 6
Figure 6. Spot‐dilution assay of designed sequences
Spot assay of designed TAD constructs. The spot assay was performed at 30°C (left) and during heat shock at 37°C (right panel).
Figure 7
Figure 7. A mechanistic model for TAD function based on findings from this study
Transcription factors (TFs) interact with DNA via the DNA binding domain (DBD, blue triangle) and to their interaction partners (gray hexagon) via their transactivation domains (TAD, green rectangle). The enrichment for negatively charged residues leads to a local extended conformation of the TAD via intra‐chain repulsions, providing the appropriate context for aromatic residues to be exposed and to bind to their interaction partners in hydrophobic binding pocket (circular inset). The aromatic residues could fit the pocket in a stochastic manner, binding in different configurations. This would result in the formation of a “fuzzy” complex. In this case, the negative charges could furthermore contribute to the affinity of binding given that the TAD‐interaction surfaces often expose positively charged patches. The absence of positively charged residues, compositional bias, and particular spacing of negatively charged and aromatic residues could hence be considered as giving rise to a collection of short mini‐motifs that collectively contributes to TAD functionality.

Comment in

References

    1. Abedi M, Caponigro G, Shen J, Hansen S, Sandrock T, Kamb A (2001) Transcriptional transactivation by selected short random peptides attached to lexA‐GFP fusion proteins. BMC Mol Biol 2: 10 - PMC - PubMed
    1. Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181: 223–230 - PubMed
    1. Betts MJ, Russell RB (2003) Amino acid properties and consequences of substitutions In Bioinformatics for geneticists: a bioinformatics primer for the analysis of genetic data, Barnes MR. (ed.), 2nd edn, pp. 311–342. Hoboken, NJ: Wiley;
    1. Bhaumik SR, Green MR (2001) SAGA is an essential in vivo target of the yeast acidic activator Gal4p. Genes Dev 15: 1935–1945 - PMC - PubMed
    1. Boucher JI, Cote P, Flynn J, Jiang L, Laban A, Mishra P, Roscoe BP, Bolon DN (2014) Viewing protein fitness landscapes through a next‐gen lens. Genetics 198: 461–471 - PMC - PubMed

Publication types

MeSH terms