. 2022 Jan 30;434(2):167373.

doi: 10.1016/j.jmb.2021.167373. Epub 2021 Dec 1.

Uncovering Non-random Binary Patterns Within Sequences of Intrinsically Disordered Proteins

Megan C Cohan¹, Min Kyung Shinn¹, Jared M Lalmansingh², Rohit V Pappu³

Affiliations

¹ Department of Biomedical Engineering and Center for Science & Engineering of Living Systems (CSELS), Washington University in St. Louis, MO 63130, USA.
² Department of Physics, Washington University in St. Louis, MO 63130, USA.
³ Department of Biomedical Engineering and Center for Science & Engineering of Living Systems (CSELS), Washington University in St. Louis, MO 63130, USA. Electronic address: pappu@wustl.edu.

PMID: 34863777
PMCID: PMC10178624
DOI: 10.1016/j.jmb.2021.167373

Uncovering Non-random Binary Patterns Within Sequences of Intrinsically Disordered Proteins

Megan C Cohan et al. J Mol Biol. 2022.

. 2022 Jan 30;434(2):167373.

doi: 10.1016/j.jmb.2021.167373. Epub 2021 Dec 1.

Authors

Megan C Cohan¹, Min Kyung Shinn¹, Jared M Lalmansingh², Rohit V Pappu³

Affiliations

¹ Department of Biomedical Engineering and Center for Science & Engineering of Living Systems (CSELS), Washington University in St. Louis, MO 63130, USA.
² Department of Physics, Washington University in St. Louis, MO 63130, USA.
³ Department of Biomedical Engineering and Center for Science & Engineering of Living Systems (CSELS), Washington University in St. Louis, MO 63130, USA. Electronic address: pappu@wustl.edu.

PMID: 34863777
PMCID: PMC10178624
DOI: 10.1016/j.jmb.2021.167373

Abstract

Sequence-ensemble relationships of intrinsically disordered proteins (IDPs) are governed by binary patterns such as the linear clustering or mixing of specific residues or residue types with respect to one another. To enable the discovery of potentially important, shared patterns across sequence families, we describe a computational method referred to as NARDINI for Non-random Arrangement of Residues in Disordered Regions Inferred using Numerical Intermixing. This work was partially motivated by the observation that parameters that are currently in use for describing different binary patterns are not interoperable across IDPs of different amino acid compositions and lengths. In NARDINI, we generate an ensemble of scrambled sequences to set up a composition-specific null model for the patterning parameters of interest. We then compute a series of pattern-specific z-scores to quantify how each pattern deviates from a null model for the IDP of interest. The z-scores help in identifying putative non-random linear sequence patterns within an IDP. We demonstrate the use of NARDINI derived z-scores by identifying sequence patterns in three well-studied IDP systems. We also demonstrate how NARDINI can be deployed to study archetypal IDPs across homologs and orthologs. Overall, NARDINI is likely to aid in designing novel IDPs with a view toward engineering new sequence-function relationships or uncovering cryptic ones. We further propose that the z-scores introduced here are likely to be useful for theoretical and computational descriptions of sequence-ensemble relationships across IDPs of different compositions and lengths.

Keywords: CIDER; NARDINI; binary patterns; intrinsically disordered proteins/regions.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Figure 1.. Plot of how the most likely value of κ+-, measured as the mean of the gamma distributions (mκ+-, depends on the fraction of charged residues (FCR) for sequences of 50 and 100 residues.**
The null-scramble expectations, i.e., the mean values of gamma distributed $κ_{+ -}$ values are shown for sequences that 50 and 100 residues long as a function of FCR. The mean value of $κ_{+ -}$ is dependent on FCR for low values of FCRs (< 0.3) and this dependency is also manifest for different sequence lengths.

**Figure 2.. The mean value of SCD, quantified from the gamma distribution of SCD values for 10⁵ scrambled sequences, depends on the sequence composition and length.**
The null-scramble expectations of SCD are plotted for sequences of 50 and 100 residues as a function of fraction of residue x that is positively charged and y that is negatively charged (SCD_xy). The expectation of SCD is dependent on the sequence length.

**Figure 3.. The mean value for the Ω parameter, as extracted from the gamma distribution obtained using 10⁵ randomly shuffled sequences, depends on amino acid composition.**
The plot shows the null-scramble expectations of the most likely value for $(Ω_{x})$ for sequences of 100 residues as a function of the fraction of residue x.

**Figure 4.. Workflow for the calculation of z-score matrices for various binary patterns.**
Process includes generating the null-scramble model (“null model”) and calculating the deviation of the observed value from the null model as z-scores.

**Figure 5.. Elements of a typical, sequence-specific z-score matrix.**
Each cell quantifies the z-score of a specific patterning parameter. The diagonal elements are z-scores for $Ω$ parameters whereas the off-diagonal elements are z-scores for the $δ$ -parameters. Residues are grouped into eight categories: $(μ) {S, T, N, Q, C, H}$ , hydrophobic; (h) {I,L,M,V}; positive $(+) {R, K}$ ; negative $(-) {E, D}$ ; aromatic $(π) {F, w, y}$ ; alanine {A}; proline {P}; and glycine {G}.

**Figure 6.. The z-score matrix of Al-LCD.**
A1-LCD shows non-random segregation of polar and glycine residues from one another and from other residues. This sequence also features non-random dispersion of aromatic residues. White squares on the checkerboard plot imply that the associated z-scores are $\approx 0$ .

**Figure 7:. Elements of the z-score matrices for 849 homologs of A1-LCD.**
The color bar provides quantitative annotation for the heat map. This analysis reveals the following statistically significant binary patterns across homologs: (i) pronounced segregation of polar and Gly residues from one another $(μ G)$ ; (ii) uniform dispersion of aromatic residues with respect to one another $(π π)$ ; and (iii) segregation of Gly residues into clusters (GG).

**Figure 8.. Direct comparison of z-score matrices of RNase E from (a) C. crescentus and (b) E. coli.**
Patterns associated with charged residues in *C. crescentus RNase* E (left) are > +2.4 standard deviations away from the null-scramble model in the positive direction. *E. coli* RNase E shows non-random segregation of positive residues and hydrophobic residues as well as from other residues, and hydrophobic residues also contribute to non-random patterns. Unlike the *C. crescentus* RNase E, patterns involving negative residues in *E. coli* RNase E do not significantly deviate from the null model.

**Figure 9:. Analysis of z-score matrices across CTDs from 1084 RNase E orthologs.**
Each row below the row with labeled as Class denotes the z-score for a distinct binary pattern. The color bar provides annotation of the z-scores. Positive z-scores, denoted as red colors, quantify the extent of linear clustering of residue types within a sequence. Within the alphaproteobacterial class (black) there is a clear preference for the segregation of positively charged residues with respect to all other residue types leading to tracts of basic residues. This preference is weakened for CTDs of RNases E from the betaproteobacterial class (white). The class-specific preferences are illustrated using a dendrogram shown at the top of the figure. The dendrogram was generated using the Frobenius norms of z-score matrices, where the norms were used as Euclidean distances and Ward’s clustering was used to generate the dendrograms.

**Figure 10.. The z-score matrix of E. coli SSB IDL.**
The positioning of glycine, proline, and polar residues along the sequence is significant. The statistically significant deviations from the null model include the linear clustering of Gly residues and the punctuation of these clusters by Pro residues giving the IDL.

**Figure 11:. Heatmap of z-scores across IDLs of 1523 orthologous SSBs.**
Notice the positive z-scores for $Ω_{G}$ labeled as GG in the figure. Other statistically significant patterns include the segregation of polar residues from Gly $(μ G)$ and acidic residues $(μ -)$ .

**Figure 12.. The frequency of observing a non-random pattern for the SSB IDLs.**
Squares within the checkerboard plot that do not rise about the 10% threshold are shown in white color. In all, we analyzed six phylum classes. These include actinobacteria (n = 190), bacilli $γ$ -proteobacteria (n = 359), $α$ -proteobacteria (n = 122), $ε$ -proteobacteria (n = 143), and spirochaetia (n = 101). The bar graph on the top of the matrix represents the relative frequencies of observing a non-random feature (z> +1.5) involving each residue / residue type.

See this image and copyright information in PMC

References

1. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT, (2004). Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 337, 635–645. - PubMed
1. Peng Z, Mizianty MJ, Kurgan L, (2014). Genome-scale prediction of proteins with long intrinsically disordered regions. Proteins: Structure, Function, and Bioinformatics, 82, 145–158. - PubMed
1. Wright PE, Dyson HJ, (2009). Linking folding and binding. Current Opinion in Structural Biology, 19, 31–38. - PMC - PubMed
1. Zarin T, Tsai CN, Nguyen Ba AN, Moses AM, (2017). Selection maintains signaling function of a highly diverged intrinsically disordered region. Proceedings of the National Academy of Sciences, 114, E1450–E1459. - PMC - PubMed
1. Zarin T, Strome B, Ba ANN, Alberti S, Forman-Kay JD, Moses AM, (2019). Proteome-wide signatures of function in highly diverged intrinsically disordered regions. Elife, 8, e46883. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Uncovering Non-random Binary Patterns Within Sequences of Intrinsically Disordered Proteins

Affiliations

Uncovering Non-random Binary Patterns Within Sequences of Intrinsically Disordered Proteins

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources