Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Jan-Dec;14(1):2008790.
doi: 10.1080/19420862.2021.2008790.

Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies

Affiliations
Review

Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies

Rahmad Akbar et al. MAbs. 2022 Jan-Dec.

Abstract

Although the therapeutic efficacy and commercial success of monoclonal antibodies (mAbs) are tremendous, the design and discovery of new candidates remain a time and cost-intensive endeavor. In this regard, progress in the generation of data describing antigen binding and developability, computational methodology, and artificial intelligence may pave the way for a new era of in silico on-demand immunotherapeutics design and discovery. Here, we argue that the main necessary machine learning (ML) components for an in silico mAb sequence generator are: understanding of the rules of mAb-antigen binding, capacity to modularly combine mAb design parameters, and algorithms for unconstrained parameter-driven in silico mAb sequence synthesis. We review the current progress toward the realization of these necessary components and discuss the challenges that must be overcome to allow the on-demand ML-based discovery and design of fit-for-purpose mAb therapeutic candidates.

Keywords: Machine learning; antibody; antigen; artificial intelligence; developability; drug design.

PubMed Disclaimer

Conflict of interest statement

V.G. declares advisory board positions in aiNET GmbH and Enpicom B.V.G. is a consultant for Roche/Genentech.

Figures

Figure 1.
Figure 1.
Overview of progress and challenges within the three technological pillars for ML-based on-demand generation of mAb therapeutic candidates, namely learnability, modularity, and unconstrained generation. We highlight three key optimizable design parameters for in silico on-demand mAb design: (i) the AA residues at the surface of the antigen (epitope) that engage the antibody residues (paratope) at the interaction interface, (ii) the strength of an antibody–antigen interaction (affinity), and (iii) the extent to which the mAb successfully progresses from the discovery to the development phase (developability). We discuss these design parameters from the perspective of three technological pillars: (i) learnability indicates the presence of rules underlying antibody–antigen interactions as well as antibody developability, (ii) modularity signifies that antibody design parameters could be impacted by multiple regions on the antibody and the extent to which they can be recombined interdependently, and (iii) unconstrained generation signifies the capacity of high-throughput in silico synthesis of fit-for-purpose mAb candidates.
Figure 2.
Figure 2.
Overview of public datasets on antibody developability, experimental, or synthetic sequence and structural antibody(-antigen) data. The available sequence and structural datasets were queried from Europe pubmed central (europepmc.org) using keywords “antibody” and “database” and filtered for publications that contain these keywords in the title in addition to manual literature curation (codes and data are available as mentioned in the Code availability section of this manuscript). The datasets are visualized with respect to the sequence or structure, and the availability of binding affinity, antigen annotation, developability parameters, or paratope and epitope information. Sequence (red), structure (blue), synthetic structural data (purple) and developability (gray) are color-coded. Each circle corresponds to a specific type of data. The outer circles correspond to the global data (sequences, structures, synthetic structures, and developability), and the inner ones – to the subdata (antibody-antigen complexes, Ig repertoire, mAbs, and paratope and epitope). A separate outer circle for developability is used as its data types differ from the others. Since there is not a single database containing quantitative information about the available developability parameters, we used the data from as an example for visualizing the scarcity of available experimental developability information. The outer red ring represents the number of antibody sequences in the iReceptor database (the largest publicly available sequence data, the outer purple ring the number of synthetic antibody–antigen binding structures from Absolut! (the largest publicly available synthetic antibody-antigen structural dataset), the outer blue ring displays the number of structures from AbDb (curated antibody–antigen structural data obtained from the protein data bank), and the outer grey ring represents developability information. inner rings illustrate information about antibody-antigen complexes, ig repertoire, therapeutic antibodies, and paratope and epitope data. For a curated overview of available databases, see Focus Box 1.
Figure 3.
Figure 3.
Major ML components that could enable the identification of the rules that govern antibody design parameters (binding, paratope-epitope, and developability). These components relate to the five ML challenges namely (1) predictability, (2) generalization, (3) interpretability, (4) model uncertainty, and (5) data completeness. Multiplexing (integration and augmentation) of data with varying degrees of information may improve the completeness of the training data which would consequently produce an informed representation (learned or otherwise) and allows for data-driven mAb design. As synthetic data tend to be superior (crisp icons) in comparison to experimental data (fuzzy icons) with respect to quantity and the extent of completeness (the parameters and rules underlying the data are known), the augmentation of sparse experimental data with synthetic data may yield a dataset that contains a fuller degree of completeness than either subset thereof. The training of advanced deep learning architectures on informed representation (containing sequence, developability, affinity, linguistic [Focus Box 2], and paratope-epitope feature) either via online (continuous) or batch (one-off bulk data) learning would result in high accuracy models that may well be capable of generalization. Importantly, the mapping of features that are critical for the predictive performance of the model (interpretability) must be undertaken to allow for rule inference, and consequently, to allow rule-driven design.
Figure 4.
Figure 4.
Mapping of developability parameters to the antibody regions. The high-level developability parameters are shown in bold font and placed within black boxes with respective mapped antibody regions listed in brackets below each box and referred to with dashed black arrows. The widely used low-level physicochemical developability parameters are also shown in grey text and connected to respective high-level developability parameters with solid grey arrows (detailed further in Table 1). Antibody regions are color-coded as follows; Fc: grey, VH: red, VL: purple, CDRs: blue. High-level developability parameters. Viscosity, solubility, and aggregation propensity of mAbs are mainly linked to the surface-exposed regions of mAb molecules. Antigen specificity and binding affinity, on the other hand, are mainly associated with the CDRs as well as thermal stability. All regions of the antibody can impact half-life and immunogenicity. Low-level developability parameters. Viscosity has been reported to be influenced by charge, hydrophobicity, atomic/diffusion interaction, and the isoelectric point (pI) of the mAb molecule. Solvent exposure area and AA composition are frequently reported to impact the solubility of the antibody. Charge and hydrophobicity were also found to affect antibody preparation aggregation likelihood together with stability and spatial aggregation propensity (SAP) measures. The binding affinity of the Fc region to FcRn significantly impacts mAb PK, in addition to the reported role of poly-specificity, charge, and pI on mAbs half-life. The likelihood of a mAb to elicit an immune response (immunogenicity) is linked to the non-human AA sequence content of the mAb, in addition to the way it is processed (digested) into smaller peptides by APCs, bound to the human leukocyte antigen II (HLA II) and presented to T-helper cells. The hydrophobicity and AA composition of mAb CDRs were often reported to affect its thermal stability.
Figure 5.
Figure 5.
Generative models can be trained on generic or custom-designed datasets to obtain sequence space representation and to generate new sequences for a variety of use cases in antibody design. AR models enable the generation of highly diverse proteins and can be used to obtain meaningful sequence embeddings, circumventing the need for hand-crafted features. VAEs and GANs have been employed in protein generation in a similar manner to generate functionally relevant leads, obtain biologically meaningful latent representations, and condition them on additional features (e.g., solubility). As such, these models can be employed in de novo generation of sequences, conditional, or out-of-distribution generation, as well as optimization of multiple parameters. Evaluating the specificity (or any other design parameter of interest) of the in silico designed antibody sequences requires either computational or experimental oracles. As deep generative models output a large number of sequences, experimental prospective evaluation methods may not possess the time- and cost-efficiency to evaluate these sequences at scale, thus creating considerable demand for in silico oracles (Figure 5). Transfer learning may be leveraged to infer higher-order, functionally specific interactions from a small number of available sequences (low N). Integrating computational and experimental oracles or directly conditioning the generative models on additional features would enable high-yield multiparameter optimization of machine-learning engineered antibody sequences.

References

    1. Urquhart L. Top product forecasts for 2021 [Internet]. Nature Reviews Drug Discovery. 2021;20:10–35. doi:10.1038/d41573-020-00219-5. - DOI - PubMed
    1. Lu R-M, Hwang Y-C, Liu I-J, Lee -C-C, Tsai H-Z, Li H-J, Wu H-C.. Development of therapeutic antibodies for the treatment of diseases. J Biomed Sci. 2020;27(1):1–30. doi:10.1186/s12929-019-0592-z. - DOI - PMC - PubMed
    1. Laustsen AH, Greiff V, Karatt-Vellatt A, Muyldermans S, Jenkins TP. Animal Immunization, in Vitro Display Technologies, and Machine Learning for Antibody Discovery [Internet]. Trends Biotechnol. 2021; 39:1263–73. doi:10.1016/j.tibtech.2021.03.003. - DOI - PubMed
    1. Narayanan H, Dingfelder F, Butté A, Lorenzen N, Sokolov M, Arosio P. Machine Learning for Biologics: Opportunities for Protein Engineering, Developability, and Formulation [Internet]. Trends Pharmacol Sci. 2021;42:151–65. doi:10.1016/j.tips.2020.12.004. - DOI - PubMed
    1. Norman RA, Ambrosetti F, Bonvin AMJJ, Colwell LJ, Kelm S, Kumar S, Krawczyk K. Computational approaches to therapeutic antibody design: established methods and emerging trends. Brief Bioinform. 2020;21:1549–67. doi:10.1093/bib/bbz095. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources