Learning protein constitutive motifs from sequence data
- PMID: 30857591
- PMCID: PMC6436896
- DOI: 10.7554/eLife.39397
Learning protein constitutive motifs from sequence data
Abstract
Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and 'turning up' or 'turning down' the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype-phenotype relationship for protein families.
Keywords: coevolution; computational biology; machine learning; none; physics of living systems; sequence analysis; systems biology.
© 2019, Tubiana et al.
Conflict of interest statement
JT, SC, RM No competing interests declared
Figures







































Similar articles
-
Learning Compositional Representations of Interacting Systems with Restricted Boltzmann Machines: Comparative Study of Lattice Proteins.Neural Comput. 2019 Aug;31(8):1671-1717. doi: 10.1162/neco_a_01210. Epub 2019 Jul 1. Neural Comput. 2019. PMID: 31260391
-
Generative Modeling of RNA Sequence Families with Restricted Boltzmann Machines.Methods Mol Biol. 2025;2847:163-175. doi: 10.1007/978-1-0716-4079-1_11. Methods Mol Biol. 2025. PMID: 39312143
-
Proteomics analysis of two heat shock proteins in insects.J Biomol Struct Dyn. 2019 Jul;37(10):2652-2668. doi: 10.1080/07391102.2018.1494632. Epub 2018 Nov 17. J Biomol Struct Dyn. 2019. PMID: 30052126
-
Intra-molecular pathways of allosteric control in Hsp70s.Philos Trans R Soc Lond B Biol Sci. 2018 Jun 19;373(1749):20170183. doi: 10.1098/rstb.2017.0183. Philos Trans R Soc Lond B Biol Sci. 2018. PMID: 29735737 Free PMC article. Review.
-
Computational prediction of protein-protein interactions.Methods Mol Biol. 2004;261:445-68. doi: 10.1385/1-59259-762-9:445. Methods Mol Biol. 2004. PMID: 15064475 Review.
Cited by
-
Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering.bioRxiv [Preprint]. 2023 Oct 12:2023.10.10.561808. doi: 10.1101/2023.10.10.561808. bioRxiv. 2023. Update in: Natl Sci Rev. 2023 Dec 28;10(12):nwad331. doi: 10.1093/nsr/nwad331. PMID: 37873334 Free PMC article. Updated. Preprint.
-
Latent generative landscapes as maps of functional diversity in protein sequence space.Nat Commun. 2023 Apr 19;14(1):2222. doi: 10.1038/s41467-023-37958-z. Nat Commun. 2023. PMID: 37076519 Free PMC article.
-
RBM-MHC: A Semi-Supervised Machine-Learning Method for Sample-Specific Prediction of Antigen Presentation by HLA-I Alleles.Cell Syst. 2021 Feb 17;12(2):195-202.e9. doi: 10.1016/j.cels.2020.11.005. Epub 2020 Dec 17. Cell Syst. 2021. PMID: 33338400 Free PMC article.
-
Bézier interpolation improves the inference of dynamical models from data.Phys Rev E. 2023 Feb;107(2-1):024116. doi: 10.1103/PhysRevE.107.024116. Phys Rev E. 2023. PMID: 36932614 Free PMC article.
-
Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families.Mol Biol Evol. 2022 Apr 10;39(4):msac070. doi: 10.1093/molbev/msac070. Mol Biol Evol. 2022. PMID: 35353898 Free PMC article.
References
-
- Ackley DH, Hinton GE, Sejnowski TJ. Readings in Computer Vision. Elsevier; 1987. A learning algorithm for boltzmann machines; pp. 522–533.
Publication types
MeSH terms
Substances
Associated data
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
- Actions
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources