Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods
- PMID: 37242577
- PMCID: PMC10224321
- DOI: 10.3390/pharmaceutics15051337
Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods
Abstract
Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
Keywords: MCDA; TOPSIS; embeddings; ensemble learning; imbalanced assay-labeled datasets; machine learning; protein fitness prediction; sampling methods; sequence representation.
Conflict of interest statement
The authors declare no conflict of interest.
Figures








Similar articles
-
Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x. J Cheminform. 2020. PMID: 33372637 Free PMC article.
-
Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.Int J Mol Sci. 2023 Nov 18;24(22):16496. doi: 10.3390/ijms242216496. Int J Mol Sci. 2023. PMID: 38003686 Free PMC article.
-
Effect of machine learning re-sampling techniques for imbalanced datasets in 18F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients.Eur J Nucl Med Mol Imaging. 2020 Nov;47(12):2826-2835. doi: 10.1007/s00259-020-04756-4. Epub 2020 Apr 6. Eur J Nucl Med Mol Imaging. 2020. PMID: 32253486
-
Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x. BMC Med Inform Decis Mak. 2022. PMID: 36284327 Free PMC article.
-
Comparison of machine learning techniques to predict all-cause mortality using fitness data: the Henry ford exercIse testing (FIT) project.BMC Med Inform Decis Mak. 2017 Dec 19;17(1):174. doi: 10.1186/s12911-017-0566-6. BMC Med Inform Decis Mak. 2017. PMID: 29258510 Free PMC article.
Cited by
-
Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering.ACS Cent Sci. 2024 Feb 5;10(2):226-241. doi: 10.1021/acscentsci.3c01275. eCollection 2024 Feb 28. ACS Cent Sci. 2024. PMID: 38435522 Free PMC article. Review.
-
Prediction of protein biophysical traits from limited data: a case study on nanobody thermostability through NanoMelt.MAbs. 2025 Dec;17(1):2442750. doi: 10.1080/19420862.2024.2442750. Epub 2025 Jan 8. MAbs. 2025. PMID: 39772905 Free PMC article.
-
Elucidating the Molecular Mechanisms of Hederagenin-Regulated Mitophagy in Cervical Cancer SiHa Cells through an Integrative Approach Combining Proteomics and Advanced Network Association Algorithm.J Proteome Res. 2025 Apr 4;24(4):2081-2095. doi: 10.1021/acs.jproteome.5c00022. Epub 2025 Mar 26. J Proteome Res. 2025. PMID: 40135937 Free PMC article.
-
Determining key residues of engineered scFv antibody variants with improved MMP-9 binding using deep sequencing and machine learning.Comput Struct Biotechnol J. 2024 Oct 10;23:3759-3770. doi: 10.1016/j.csbj.2024.10.005. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 39525083 Free PMC article.
-
Protein engineering via sequence-performance mapping.Cell Syst. 2023 Aug 16;14(8):656-666. doi: 10.1016/j.cels.2023.06.009. Epub 2023 Jul 25. Cell Syst. 2023. PMID: 37494931 Free PMC article. Review.
References
Grants and funding
LinkOut - more resources
Full Text Sources
Medical