Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 2;41(10):btaf544.
doi: 10.1093/bioinformatics/btaf544.

Accessible, uniform protein property prediction with a scikit-learn based toolset AIDE

Affiliations

Accessible, uniform protein property prediction with a scikit-learn based toolset AIDE

Evan Komp et al. Bioinformatics. .

Abstract

Summary: Protein property prediction via machine learning with and without labeled data is becoming increasingly powerful, yet methods are disparate and capabilities vary widely over applications. The software presented here, "Artificial Intelligence Driven protein Estimation (AIDE)", enables instantiating, optimizing, and testing many zero-shot and supervised property prediction methods for variants and variable length homologs in a single, reproducible notebook or script by defining a modular, standardized application programming interface (API), i.e. drop-in compatible with scikit-learn transformers and pipelines.

Availability and implementation: AIDE is an installable, importable python package inheriting from scikit-learn classes and API and is installable on Windows, Mac, and Linux. Many of the wrapped models internal to AIDE will be effectively inaccessible without a GPU, and some assume CUDA. The newest stable, tested version can be found at https://github.com/beckham-lab/aide_predict and a full user guide and API reference can be found at https://beckham-lab.github.io/aide_predict/. Static versions of both at the time of writing can be found on Zenodo.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of AIDE. The package supplies a scikit-learn model subclass that operates on a dataclass representing proteins as opposed to numpy matrices, and number of mixin protocols for protein models to adopt common behavior and compatibility checks for user data. These models are drop-in compatible with existing scikit-learn classes and pipelines, allowing for multistep processes to be defined within a single script in a reproducible way like one would for a traditional scikit-learn application.
Figure 2.
Figure 2.
Performance of various models on four-site epistatic combinatorial dataset by Johnston et al. as a function of dataset size (Johnston et al. 2024). Embeddings for the supervised model include one-hot encoding (ohe), ESM2 mean pooling over the entire protein sequence (full), and only the four variable positions (four sites) (Meier et al. 2021). (A) Pure supervised only (“Sup.” dashed line with color) versus Zero Shot (ZS, grey) and zero shot augmented (“Aug.”, solid line with color) models with EVCouplings scores (Hopf et al. 2019). Augmentation improves performance of one-hot encoding at low training data but does not affect ESM based embedding and is negligible for >500 training points. (B) Top 10 recovery defined as the fraction of the 10 true best recovered in a final set of 96 chosen using linear versus nonlinear pure supervised models, where at about 1000 training examples, nonlinear models begin to significantly outperform linear ones.
Figure 3.
Figure 3.
Performance of various combinations of embeddings strategy and predictor head for PETases homolog activity prediction. PET hydrolysis activity measured at pH = 5.5, 40°C (Norton-Baker et al. 2025). Horizontal lines are “null” model of using only an HMM score (cyan for spearman correlation to measured activity, salmon for area under the receiver operator curve “roc_auc” of nonzero measured activity). Null model is incapable of predicting active versus inactive PET homologs. Bars are five-fold CV scores after hyperparameter optimization of linear and nonlinear random forest models against four embedding strategies for PETase activity prediction at pH = 5.5, 40°C: ESM mean pooling, SaProt mean pooling, MSATransformer flattened embedding of query sequence, and one-hot encoding of a held out alignment (Meier et al. 2021, Rao et al. 2021, Su et al. 2024).

References

    1. Ahdritz G, Bouatta N, Floristean C et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 2024;21:1514–24. - PMC - PubMed
    1. Bank C. Epistasis and adaptation on fitness landscapes. Annu Rev Ecol Evol Syst 2022;53:457–79.
    1. Blaabjerg LM, Jonsson N, Boomsma W et al. SSEmb: a joint embedding of protein sequence and structure enables robust variant effect predictions. Nat Commun 2024;15:9646. - PMC - PubMed
    1. Block P, Paern J, Hüllermeier E et al. Physicochemical descriptors to discriminate protein–protein interactions in permanent and transient complexes selected by means of machine learning algorithms. Proteins Struct Funct Bioinf 2006;65:607–22. - PubMed
    1. Bourque P, Dupuis R, Abran A et al. Fundamental principles of software engineering—a journey. J Syst Softw 2002;62:59–70.

LinkOut - more resources