Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Mehrsa Mardikoraem^{1

2}, Daniel Woldring^{1

2}

Affiliations

¹ Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA.
² Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.

PMID: 37242577
PMCID: PMC10224321
DOI: 10.3390/pharmaceutics15051337

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Mehrsa Mardikoraem et al. Pharmaceutics. 2023.

. 2023 Apr 25;15(5):1337.

doi: 10.3390/pharmaceutics15051337.

Authors

Mehrsa Mardikoraem^{1

2}, Daniel Woldring^{1

2}

Affiliations

¹ Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA.
² Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.

PMID: 37242577
PMCID: PMC10224321
DOI: 10.3390/pharmaceutics15051337

Abstract

Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

Keywords: MCDA; TOPSIS; embeddings; ensemble learning; imbalanced assay-labeled datasets; machine learning; protein fitness prediction; sampling methods; sequence representation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Overview of the implemented techniques, data attributes, and evaluation metrics. (A) Illustrates the use of sequence–function mapping to identify protein sequence functionality (e.g., therapeutics, diagnostics, enzymatic function). (B) Data attributes for the two datasets used in this study. The first dataset includes high-fitness protein binders among a pool of non-binder affibody sequences with up to 17 mutation sites. The other dataset includes a wide array of proteins with their associated melting point. (C) One-Hot encoding, physicochemical encoding, and pre-trained models were used to encode the protein sequences present in our datasets. All present protein amino acid information is in a machine-readable format, but in different ways. One-Hot encoding converts each amino acid to a binary vector of all 0 s but 1 where it belongs to its position in the matrix. In physicochemical encoding, each amino acid is represented by its physiochemical characteristics, such as polarity, charge, size, etc. Pretrained models are trained over a large corpus of unlabeled data capturing the syntax and semantics of protein language via NLP-driven models, such as next-token prediction (e.g., UniRep) and masked token prediction (e.g., ESM). (D) The sampling methods used in this study are undersampling, oversampling, and synthetic minority oversampling techniques (SMOTE). (E) The main metrics used for evaluating the performance of prediction tasks (classification and regression) are defined (a complete list of performance metrics are listed in Figure S7).

**Figure 2**
The lead physical features in naïve and enriched class discriminations in affinity-based data were H_Eisenberg, Boman Index, and H_Gravy. Gravy and Eisenberg capture hydrophobicity scales. The Boman Index is a measure of the protein’s ability to interact with its environment based on the solubility of individual residues. The enriched proteins in our library have gone through negative screening and are specific to their target. Therefore, there is a shift to a lower Boman index for this population. Note that the plot is the result of oversampling, SMOTE, in the logistic regression task.

**Figure 3**
When physical features were used to encode the affibody sequences, the mean F1-score was 75.5% with SMOTE. Both SMOTE and undersampling methods were similarly effective, with no significantally signficant difference in performance (i.e., did not reject the null hypothesis). The violin plots are created over 20 random seeds for each sampling method.

**Figure 4**
Performance analysis of encoding methods highlights the shortcomings of physical features and strength of the SMOTE sampling method. Protein sequences encoded using physical features, One-Hot, UniRep, and ESM were used to perform classification tasks among the affibody dataset. Within each encoding method, undersampling, random oversampling, and SMOTE sampling methods were evaluated. The resulting F1 scores over 20 random seeds are shown here as violin plots. The obtained p-value from ANOVA was 9.52E−190, which indicated a significant effect among comparisons. Post-hoc results for ranking methods are shown in Table S1, which consolidates the mentioned conclusions in the caption.

**Figure 5**
Voting substantially improved the predictive performance in all random initializations over different encoding methods. The plot above has three regions from left, respectively; it includes single encoding methods, concatenation of encodings, and voting of predictions. The vote was performed such that each encoding went through a predictive model over the same dataset. Then, the final prediction was obtained by majority voting. It is insightful how voting increases the models’ robustness and generalizability. The concatenation performed similarly or worse than the best model in single encodings. The best model among all predictions was Upvote with oversampling methods with Mean-F1-score = 97% and Mean-F1-score = 96.80% (no statistical significance among oversampling performances in upvoting). Refer to the supplementary material for a summary of the statistical analysis and confusion matrix plots (Table S2, Figure S2).

**Figure 6**
Upvoting achieved the best ranking both in subjective and objective weighting in MCDA design. A. The main steps for performing MCDA are elaborated. Then we highlighted our selected methods for implementing MCDA (e.g., classification criteria, model selection, and statistical analysis). B. TOPSIS scores (i.e., closeness coefficients) and their associated rankings are shown for subjective and objective weighting.

**Figure 7**
Different protein encodings potentially capture distinct functional aspects of the proteins. A 2D visualization of the encoding techniques that resulted in improved prediction in the voting method in UMAP. This method is a dimensionality reduction technique such as principal component analysis (PCA) [65] with unique advantages such as preserving the local structure of the data and capturing non-linear relationships between data points. In observing the sequence–function relationship in proteins, one can conclude that each protein sequence representation/encoding has the potential to capture different aspects of fitness.

**Figure 8**
The effect of protein size on the performance of encoding methods in stability prediction while data sizes vary. The obtained results are largely different with respect to the protein size—small proteins (aa length ≤ 120) vs. large (400 ≤ aa length ≤ 1500). Highlights: For small proteins, upon comparing the violin plots and statistical test results, protein sequence encoding methods were performed distinctively with respect to the initial dataset (protein max length = 500). One-Hot encoding had a more significant contribution in boosting the classification metrics for small proteins. As an example, when n = 400, both One-Hot and All-Encoding concatenation with a mean F1-score of 94% outperformed the other encoding methods. One-Hot tends to be problematic for large proteins as it results in a highly sparse encoding vector. This was shown in this plot when One-Hot encoding performance was not satisfactory in comparison with ESM and UniRep. When n = 400, based on both the violin plots and the post-hoc analysis after ANOVA (both Bonferroni and Tukey), either ESM or ESM_UniRep with 92% mean F1-score achieved the highest performance. One-Hot with 73% mean F1-score was the lowest score among all the encodings. Refer to the supplementary information for all one-by-one comparisons of the statistics and classification.

See this image and copyright information in PMC

References

1. Liebermeister W., Noor E., Flamholz A., Davidi D., Bernhardt J., Milo R. Visual Account of Protein Investment in Cellular Functions. Proc. Natl. Acad. Sci. USA. 2014;111:8488–8493. doi: 10.1073/pnas.1314810111. - DOI - PMC - PubMed
1. Schlessinger J. Cell Signaling by Receptor Tyrosine Kinases. Cell. 2000;103:211–225. doi: 10.1016/S0092-8674(00)00114-8. - DOI - PubMed
1. Hogan B.L. Bone Morphogenetic Proteins: Multifunctional Regulators of Vertebrate Development. Genes Dev. 1996;10:1580–1594. doi: 10.1101/gad.10.13.1580. - DOI - PubMed
1. Andrianantoandro E., Basu S., Karig D.K., Weiss R. Synthetic Biology: New Engineering Rules for an Emerging Discipline. Mol. Syst. Biol. 2006;2:2006.0028. doi: 10.1038/msb4100073. - DOI - PMC - PubMed
1. Heim M., Römer L., Scheibel T. Hierarchical Structures Made of Proteins. The Complex Architecture of Spider Webs and Their Constituent Silk Proteins. Chem. Soc. Rev. 2010;39:156–164. doi: 10.1039/B813273A. - DOI - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- The YODA Project

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Affiliations

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Medical