Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 20;12(1):92-101.e8.
doi: 10.1016/j.cels.2020.10.007. Epub 2020 Nov 18.

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Affiliations

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Hyebin Song et al. Cell Syst. .

Abstract

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Keywords: deep mutational scanning; positive-unlabeled learning; protein engineering; protein sequence function relationships; statistical learning; supervised learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests The authors declare no competing interests.

Figures

Figure 1
Figure 1
Positive-unlabeled learning from deep mutational scanning (DMS) data. (a) Overview of a typical DMS experiment. DMS experiments start with a large library of gene variants that display a range of activities. The gene library is then expressed and passed through a high-throughput screen or selection that isolates the positive variants. The activity threshold to be categorized as positive will depend on the details of the particular high-throughput screen/selection. It is often difficult or impossible to experimentally isolate negative sequences. Genes from the initial library and the isolated positive variants are then extracted and analyzed using next-generation DNA sequencing. DMS experiments generate thousands to millions of sequence examples from both the initial and positive sets. (b) DMS experiments sample sequences from protein sequence space. The resulting data contain positive labeled sequence examples (Y = 1, Z = 1), and unlabeled sequence examples (Z = 0) that contain a mixture of active and inactive sequences. (c) The relationships between variables representing protein sequences (X), latent function (Y), and the observed labels (Z). Y is not directly observed in DMS experiments and must be inferred from X and Z. (d) PU learning models the true positive-negative (PN) response, while enrichment-based estimates capture the positive-unlabeled (PU) response. Modeling the PU response gives rise to a decision boundary that is shifted toward the positive class, resulting in positive sequences that are misclassified as negative. (e) PU learning estimates the conditional effect of a mutation, while site-wise enrichment estimates the marginal effect. Marginal estimates are biased and in extreme cases can result in a sign reversal phenomenon known as Simpson’s paradox. In the example, we consider amino acid substitutions A→B at two independent sites in a protein. If we observe sequences AA, BA, and BB, the marginal estimate will reverse the sign of substitution A→B at the first position. The marginal model will also misclassify sequence BA as positive, even though it was observed to be negative. In contrast, the conditional estimate correctly models the true protein function landscape.
Figure 2
Figure 2
Performance of the PU learning method across protein data sets. (a) Receiver operating characteristic (ROC) curves for the ten tested protein data sets. ROC curves were generated using 10-fold cross-validation and corrected to account for PU data (See Methods Details and Supplemental Figure 1). (b) The PU model’s corrected ROC-AUC values range from 0.68 to 0.98, and outperform structure-based (Rosetta) and unsupervised learning methods (EVmutation and DeepSequence). (c) A statistical comparison between PU model predictions and site-wise enrichment. The PU model outperformed enrichment on all ten tested data sets, with p < 10−9.
Figure 3
Figure 3
Model parameters relate to GB1 structure and function. (a) The distribution of model coefficients. Most coefficients have a relatively small magnitude, while a substantial fraction of coefficients have a large negative effect. (b) A heatmap of the GB1 model coefficients. The wild-type amino acid is depicted with a black dot. Buried and interface residues tend to have larger magnitude coefficients, indicating their important role in GB1 function. Buried and interface residues were determined from the protein G crystal structure (PDB ID: 1FCC). Buried residues were defined as having a relative solvent accessibility less than 0.1. Interface residues were defined as having a heavy atom within 4Å of IgG. (c) The site-wise average model coefficients mapped onto the protein G crystal structure (PDB ID: 1FCC). The IgG binding partner is depicted as a grey surface. Residues in the protein core and binding interface tend to have the largest average coefficients.
Figure 4
Figure 4
Applying the PU model to design enhanced proteins. (a) A plot of model coefficients versus p-values. Sequences were designed to combine ten mutations with the largest coefficient values, smallest p-values, or largest enrichment scores. (b) The positions chosen by the three design methods are mapped onto the Bgl3 protein structure. The structure is based on the Bgl3 crystal structure (PDB ID: 1GNX) and missing termini/loops were built in using MODELLER (Sali & Blundell 1993). (c) Thermostability curves for wild-type Bgl3 and the three designed proteins. T50 values were estimated by fitting a sigmoid function to the fraction of active enzyme. Note the curve for Bgl.en is shown in yellow and falls directly behind the orange Bgl.cf curve.

Similar articles

Cited by

References

    1. Abriata LA, Bovigny C & Dal Peraro M (2016), ‘Detection and sequence/structure mapping of biophysical constraints to protein variation in saturated mutational libraries and protein sequence alignments with a dedicated server’, BMC Bioinformatics 17(1). - PMC - PubMed
    1. Alford RF, Leaver-Fay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, Shapovalov MV, Renfrew PD, Mulligan VK, Kappel K, Labonte JW, Pacella MS, Bonneau R, Bradley P, Dunbrack RL, Das R, Baker D, Kuhlman B, Kortemme T & Gray JJ (2017), ‘The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design’, Journal of Chemical Theory and Computation 13(6), 3031–3048. - PMC - PubMed
    1. Alvizo O, Nguyen LJ, Savile CK, Bresson JA, Lakhapatri SL, Solis EOP, Fox RJ, Broering JM, Benoit MR, Zimmerman SA, Novick SJ, Liang J & Lalonde JJ (2014), ‘Directed evolution of an ultrastable carbonic anhydrase for highly efficient carbon capture from flue gas’, Proceedings of the National Academy of Sciences 111(46), 16436–16441. - PMC - PubMed
    1. Bedbrook CN, Yang KK, Robinson JE, Mackey ED, Gradinaru V & Arnold FH (2019), ‘Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics’, Nature Methods 16(11), 1176–1184. - PMC - PubMed
    1. Benjamini Y & Hochberg Y (1995), ‘Controlling the false discovery rate: A practical and powerful approach to multiple testing’, J. R. Stat. Soc. Series B Stat. Methodol 57(1), 289–300.

Publication types

LinkOut - more resources