Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning
- PMID: 33212013
- PMCID: PMC7856229
- DOI: 10.1016/j.cels.2020.10.007
Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning
Abstract
Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.
Keywords: deep mutational scanning; positive-unlabeled learning; protein engineering; protein sequence function relationships; statistical learning; supervised learning.
Copyright © 2020 Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of Interests The authors declare no competing interests.
Figures
References
-
- Alford RF, Leaver-Fay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, Shapovalov MV, Renfrew PD, Mulligan VK, Kappel K, Labonte JW, Pacella MS, Bonneau R, Bradley P, Dunbrack RL, Das R, Baker D, Kuhlman B, Kortemme T & Gray JJ (2017), ‘The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design’, Journal of Chemical Theory and Computation 13(6), 3031–3048. - PMC - PubMed
-
- Alvizo O, Nguyen LJ, Savile CK, Bresson JA, Lakhapatri SL, Solis EOP, Fox RJ, Broering JM, Benoit MR, Zimmerman SA, Novick SJ, Liang J & Lalonde JJ (2014), ‘Directed evolution of an ultrastable carbonic anhydrase for highly efficient carbon capture from flue gas’, Proceedings of the National Academy of Sciences 111(46), 16436–16441. - PMC - PubMed
-
- Benjamini Y & Hochberg Y (1995), ‘Controlling the false discovery rate: A practical and powerful approach to multiple testing’, J. R. Stat. Soc. Series B Stat. Methodol 57(1), 289–300.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
