. 2023 Sep;32(9):e4739.

doi: 10.1002/pro.4739.

IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning

Wen-Lin Ho¹, Hsuan-Cheng Huang², Jie-Rong Huang^{1

2

3}

Affiliations

¹ Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung University, Taipei, Taiwan.
² Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan.
³ Department of Life Sciences and Institute of Genome Sciences, National Yang Ming Chiao Tung University, Taipei, Taiwan.

PMID: 37498545
PMCID: PMC10443345
DOI: 10.1002/pro.4739

IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning

Wen-Lin Ho et al. Protein Sci. 2023 Sep.

. 2023 Sep;32(9):e4739.

doi: 10.1002/pro.4739.

Authors

Wen-Lin Ho¹, Hsuan-Cheng Huang², Jie-Rong Huang^{1

2

3}

Affiliations

¹ Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung University, Taipei, Taiwan.
² Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan.
³ Department of Life Sciences and Institute of Genome Sciences, National Yang Ming Chiao Tung University, Taipei, Taiwan.

PMID: 37498545
PMCID: PMC10443345
DOI: 10.1002/pro.4739

Abstract

Conserved residues in protein homolog sequence alignments are structurally or functionally important. For intrinsically disordered proteins or proteins with intrinsically disordered regions (IDRs), however, alignment often fails because they lack a steric structure to constrain evolution. Although sequences vary, the physicochemical features of IDRs may be preserved in maintaining function. Therefore, a method to retrieve common IDR features may help identify functionally important residues. We applied unsupervised contrastive learning to train a model with self-attention neuronal networks on human IDR orthologs. Parameters in the model were trained to match sequences in ortholog pairs but not in other IDRs. The trained model successfully identifies previously reported critical residues from experimental studies, especially those with an overall pattern (e.g., multiple aromatic residues or charged blocks) rather than short motifs. This predictive model can be used to identify potentially important residues in other proteins, improving our understanding of their functions. The trained model can be run directly from the Jupyter Notebook in the GitHub repository using Binder (mybinder.org). The only required input is the primary sequence. The training scripts are available on GitHub (https://github.com/allmwh/IFF). The training datasets have been deposited in an Open Science Framework repository (https://osf.io/jk29b).

Keywords: intrinsically disordered proteins; liquid-liquid phase separation; unsupervised contrastive machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**FIGURE 1**
Flowchart of the training scheme. (a) Schematic representation of how the training datasets were constructed from human sequences (orange lines) and orthologs (green lines). (b) A training batch made up of 50 randomly selected subgroups. (c) Embedding of the human sequence and one of its orthologs from the same subgroup (selection probability weighted by dissimilarity) to different dimensions (as a tensor for each sequence). (d) The architecture of the training model. The steps in panels (b)–(d) were repeated 580 times to cover all subgroups in the training set, and the whole process (a training epoch) was repeated 400 times.

**FIGURE 2**
Results of the trained model for reference proteins and attention score distributions for individual amino acids. (a–d) Sequences and attention scores for the intrinsically disordered regions of (a) the RNA‐binding proteins TDP‐43, FUS, and hnRNP‐A1, (b) human and zebrafish galectin‐3, (c) NPM1, FMRP, and Caprin‐1, and (d) Pbp‐1. The attention scores appear as heatmaps from high (red) to low (gray) in the top row of each protein along with residue numbers. Amino acids with different physical properties are shown on separate rows as indicated in panel (a). Purple arrows indicate amino acids of known functional importance. (e) Half‐violin plots of the distribution of attention scores in human IDRs for each amino acid, sorted by median value from high (tryptophan, W) to low (alanine, A). IDRs, intrinsically disordered regions.

See this image and copyright information in PMC

Cited by

SHARK: web server for alignment-free homology assessment for intrinsically disordered and unalignable protein regions.
Willis Chow CF, Scheremetjew M, Moon H, Ghosh S, Hadarovich A, Hersemann L, Toth-Petroczy A. Willis Chow CF, et al. Nucleic Acids Res. 2025 Jul 7;53(W1):W512-W519. doi: 10.1093/nar/gkaf408. Nucleic Acids Res. 2025. PMID: 40396357 Free PMC article.
SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences.
Chow CFW, Ghosh S, Hadarovich A, Toth-Petroczy A. Chow CFW, et al. Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9. Proc Natl Acad Sci U S A. 2024. PMID: 39383002 Free PMC article.
SHARK-capture identifies functional motifs in intrinsically disordered protein regions.
Chow CFW, Lenz S, Scheremetjew M, Ghosh S, Richter D, Jegers C, von Appen A, Alberti S, Toth-Petroczy A. Chow CFW, et al. Protein Sci. 2025 Apr;34(4):e70091. doi: 10.1002/pro.70091. Protein Sci. 2025. PMID: 40100159 Free PMC article.

References

1. Alberti S, Gladfelter A, Mittag T. Considerations and challenges in studying liquid‐liquid phase separation and biomolecular condensates. Cell. 2019;176(3):419–434. - PMC - PubMed
1. Alberti S, Hyman AA. Biomolecular condensates at the nexus of cellular stress, protein aggregation disease and ageing. Nat Rev Mol Cell Biol. 2021;22(3):196–213. - PubMed
1. AlQuraishi M. End‐to‐end differentiable learning of protein structure. Cell Syst. 2019;8(4):292–301.e3. - PMC - PubMed
1. Altenhoff AM, Train CM, Gilbert KJ, Mediratta I, Mendes de Farias T, Moi D, et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 2021;49(D1):D373–D379. - PMC - PubMed
1. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three‐track neural network. Science. 2021;373(6557):871–876. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Wiley

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning

Affiliations

IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources