. 2021 Jan 20;12(1):92-101.e8.

doi: 10.1016/j.cels.2020.10.007. Epub 2020 Nov 18.

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Hyebin Song¹, Bennett J Bremer², Emily C Hinds², Garvesh Raskutti³, Philip A Romero⁴

Affiliations

¹ Department of Statistics, The Pennsylvania State University, State College, PA 16802, USA; Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.
² Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA.
³ Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁴ Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA. Electronic address: promero2@wisc.edu.

PMID: 33212013
PMCID: PMC7856229
DOI: 10.1016/j.cels.2020.10.007

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Hyebin Song et al. Cell Syst. 2021.

. 2021 Jan 20;12(1):92-101.e8.

doi: 10.1016/j.cels.2020.10.007. Epub 2020 Nov 18.

Authors

Hyebin Song¹, Bennett J Bremer², Emily C Hinds², Garvesh Raskutti³, Philip A Romero⁴

Affiliations

¹ Department of Statistics, The Pennsylvania State University, State College, PA 16802, USA; Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.
² Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA.
³ Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁴ Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA. Electronic address: promero2@wisc.edu.

PMID: 33212013
PMCID: PMC7856229
DOI: 10.1016/j.cels.2020.10.007

Abstract

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Keywords: deep mutational scanning; positive-unlabeled learning; protein engineering; protein sequence function relationships; statistical learning; supervised learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests The authors declare no competing interests.

Figures

**Figure 1**
Positive-unlabeled learning from deep mutational scanning (DMS) data. (a) Overview of a typical DMS experiment. DMS experiments start with a large library of gene variants that display a range of activities. The gene library is then expressed and passed through a high-throughput screen or selection that isolates the positive variants. The activity threshold to be categorized as positive will depend on the details of the particular high-throughput screen/selection. It is often difficult or impossible to experimentally isolate negative sequences. Genes from the initial library and the isolated positive variants are then extracted and analyzed using next-generation DNA sequencing. DMS experiments generate thousands to millions of sequence examples from both the initial and positive sets. (b) DMS experiments sample sequences from protein sequence space. The resulting data contain positive labeled sequence examples (Y = 1, Z = 1), and unlabeled sequence examples (Z = 0) that contain a mixture of active and inactive sequences. (c) The relationships between variables representing protein sequences (X), latent function (Y), and the observed labels (Z). Y is not directly observed in DMS experiments and must be inferred from X and Z. (d) PU learning models the true positive-negative (PN) response, while enrichment-based estimates capture the positive-unlabeled (PU) response. Modeling the PU response gives rise to a decision boundary that is shifted toward the positive class, resulting in positive sequences that are misclassified as negative. (e) PU learning estimates the conditional effect of a mutation, while site-wise enrichment estimates the marginal effect. Marginal estimates are biased and in extreme cases can result in a sign reversal phenomenon known as Simpson’s paradox. In the example, we consider amino acid substitutions A→B at two independent sites in a protein. If we observe sequences AA, BA, and BB, the marginal estimate will reverse the sign of substitution A→B at the first position. The marginal model will also misclassify sequence BA as positive, even though it was observed to be negative. In contrast, the conditional estimate correctly models the true protein function landscape.

**Figure 2**
Performance of the PU learning method across protein data sets. (a) Receiver operating characteristic (ROC) curves for the ten tested protein data sets. ROC curves were generated using 10-fold cross-validation and corrected to account for PU data (See Methods Details and Supplemental Figure 1). (b) The PU model’s corrected ROC-AUC values range from 0.68 to 0.98, and outperform structure-based (Rosetta) and unsupervised learning methods (EVmutation and DeepSequence). (c) A statistical comparison between PU model predictions and site-wise enrichment. The PU model outperformed enrichment on all ten tested data sets, with *p <* 10⁻⁹.

**Figure 3**
Model parameters relate to GB1 structure and function. (a) The distribution of model coefficients. Most coefficients have a relatively small magnitude, while a substantial fraction of coefficients have a large negative effect. (b) A heatmap of the GB1 model coefficients. The wild-type amino acid is depicted with a black dot. Buried and interface residues tend to have larger magnitude coefficients, indicating their important role in GB1 function. Buried and interface residues were determined from the protein G crystal structure (PDB ID: 1FCC). Buried residues were defined as having a relative solvent accessibility less than 0.1. Interface residues were defined as having a heavy atom within 4Å of IgG. (c) The site-wise average model coefficients mapped onto the protein G crystal structure (PDB ID: 1FCC). The IgG binding partner is depicted as a grey surface. Residues in the protein core and binding interface tend to have the largest average coefficients.

**Figure 4**
Applying the PU model to design enhanced proteins. (a) A plot of model coefficients versus p-values. Sequences were designed to combine ten mutations with the largest coefficient values, smallest p-values, or largest enrichment scores. (b) The positions chosen by the three design methods are mapped onto the Bgl3 protein structure. The structure is based on the Bgl3 crystal structure (PDB ID: 1GNX) and missing termini/loops were built in using MODELLER (Sali & Blundell 1993). (c) Thermostability curves for wild-type Bgl3 and the three designed proteins. T₅₀ values were estimated by fitting a sigmoid function to the fraction of active enzyme. Note the curve for Bgl.en is shown in yellow and falls directly behind the orange Bgl.cf curve.

See this image and copyright information in PMC

Cited by

Discovery of human ACE2 variants with altered recognition by the SARS-CoV-2 spike protein.
Heinzelman P, Romero PA. Heinzelman P, et al. bioRxiv [Preprint]. 2020 Sep 17:2020.09.17.301861. doi: 10.1101/2020.09.17.301861. bioRxiv. 2020. Update in: PLoS One. 2021 May 12;16(5):e0251585. doi: 10.1371/journal.pone.0251585. PMID: 32995796 Free PMC article. Updated. Preprint.
MBE: model-based enrichment estimation and prediction for differential sequencing data.
Busia A, Listgarten J. Busia A, et al. Genome Biol. 2023 Oct 2;24(1):218. doi: 10.1186/s13059-023-03058-w. Genome Biol. 2023. PMID: 37784130 Free PMC article.
QAFI: a novel method for quantitative estimation of missense variant impact using protein-specific predictors and ensemble learning.
Ozkan S, Padilla N, de la Cruz X. Ozkan S, et al. Hum Genet. 2025 Mar;144(2-3):191-208. doi: 10.1007/s00439-024-02692-z. Epub 2024 Jul 24. Hum Genet. 2025. PMID: 39048855 Free PMC article.
Deep neural networks for predicting the affinity landscape of protein-protein interactions.
Meiri R, Aharoni Lotati SL, Orenstein Y, Papo N. Meiri R, et al. iScience. 2024 Aug 19;27(9):110772. doi: 10.1016/j.isci.2024.110772. eCollection 2024 Sep 20. iScience. 2024. PMID: 39310756 Free PMC article.
Design of synthetic human gut microbiome assembly and butyrate production.
Clark RL, Connors BM, Stevenson DM, Hromada SE, Hamilton JJ, Amador-Noguez D, Venturelli OS. Clark RL, et al. Nat Commun. 2021 May 31;12(1):3254. doi: 10.1038/s41467-021-22938-y. Nat Commun. 2021. PMID: 34059668 Free PMC article.

See all "Cited by" articles

References

1. Abriata LA, Bovigny C & Dal Peraro M (2016), ‘Detection and sequence/structure mapping of biophysical constraints to protein variation in saturated mutational libraries and protein sequence alignments with a dedicated server’, BMC Bioinformatics 17(1). - PMC - PubMed
1. Alford RF, Leaver-Fay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, Shapovalov MV, Renfrew PD, Mulligan VK, Kappel K, Labonte JW, Pacella MS, Bonneau R, Bradley P, Dunbrack RL, Das R, Baker D, Kuhlman B, Kortemme T & Gray JJ (2017), ‘The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design’, Journal of Chemical Theory and Computation 13(6), 3031–3048. - PMC - PubMed
1. Alvizo O, Nguyen LJ, Savile CK, Bresson JA, Lakhapatri SL, Solis EOP, Fox RJ, Broering JM, Benoit MR, Zimmerman SA, Novick SJ, Liang J & Lalonde JJ (2014), ‘Directed evolution of an ultrastable carbonic anhydrase for highly efficient carbon capture from flue gas’, Proceedings of the National Academy of Sciences 111(46), 16436–16441. - PMC - PubMed
1. Bedbrook CN, Yang KK, Robinson JE, Mackey ED, Gradinaru V & Arnold FH (2019), ‘Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics’, Nature Methods 16(11), 1176–1184. - PMC - PubMed
1. Benjamini Y & Hochberg Y (1995), ‘Controlling the false discovery rate: A practical and powerful approach to multiple testing’, J. R. Stat. Soc. Series B Stat. Methodol 57(1), 289–300.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Affiliations

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources