Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Dec 4:rs.3.rs-2988283.
doi: 10.21203/rs.3.rs-2988283/v2.

Activity Cliff-Informed Contrastive Learning for Molecular Property Prediction

Affiliations

Activity Cliff-Informed Contrastive Learning for Molecular Property Prediction

Wan Xiang Shen et al. Res Sq. .

Abstract

Modeling molecular activity and quantitative structure-activity relationships of chemical compounds is critical in drug design. Graph neural networks, which utilize molecular structures as frames, have shown success in assessing the biological activity of chemical compounds, guiding the selection and optimization of candidates for further development. However, current models often overlook activity cliffs (ACs)-cases where structurally similar molecules exhibit different bioactivities-due to latent spaces primarily optimized for structural features. Here, we introduce AC-awareness (ACA), an inductive bias designed to enhance molecular representation learning for activity modeling. The ACA jointly optimizes metric learning in the latent space and task performance in the target space, making models more sensitive to ACs. We develop ACANet, an AC-informed contrastive learning approach that can be integrated with any graph neural network. Experiments on 39 benchmark datasets demonstrate that AC-informed representations of chemical compounds consistently outperform standard models in bioactivity prediction across both regression and classification tasks. AC-informed models show strong performance in predicting pharmacokinetic and safety-relevant molecular properties. ACA paves the way toward activity-informed molecular representations, providing a valuable tool for the early stages of lead compound identification, refinement, and virtual screening.

Keywords: Activity Cliff; Activity Cliff Awareness; Contrastive Learning; Graph neural networks.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Figure 1:
Figure 1:. Overview of the proposed ACA loss and the GNN-based ACANET for molecular activity prediction.
(a) ACANET training and ACA loss. During the training of ACANET, the ACA loss is calculated for each batch. The model dynamically mines activity cliff triplets (ACTs) in real time and distinguishes them in the latent feature space. (b) Cliff cut-off scenarios for on-the-fly ACT mining. Two cliff cut-offs, cliff lower (cl) and cliff upper (cu), are used to identify the positives (P) and negatives (N) for each given anchor (A) compound to obtain the conditional ACTs. One scenario is when the cl is equal to the cu. High-value (HV) ACTs are conditional ACTs with triplet loss greater than zero. (c) Soft margin triplets in ACA loss. Each triplet has a unique margin calculated from the ground truth labels rather than using a fixed margin for all triplets. These variable margins help the model adapt to the continuous activity labels in the regression task, improving its ability to accurately differentiate between compounds with similar and different activities.
Figure 2:
Figure 2:. Schematic illustration of the datasets and tasks involved in this study.
We evaluate the proposed method on four different cliff scenarios across 52 datasets. The first three scenarios involve protein target activity data, while the last pertains to ADMET data. The first two scenarios aim to predict the absolute activity values for a specific target, while the last two focus on predicting the relative differences of paired compounds. The third scenario involves the classification of cliff and non-cliff compounds, and the last scenario is a delta regression task. (a) The nine low-sample size and narrow scaffold (LSSNS) datasets. These datasets consist of compounds with limited and narrow scaffolds, typically developed by certain institutions or labs (Supplementary Table S1). These compounds are highly similar in structure, often being derivatives or analogs, but they exhibit significantly different activities to a specific target. (b) The 30 high-sample size and mixed scaffold (HSSMS) datasets. These 30 benchmark datasets were cleaned and compiled specifically to evaluate the activity prediction performance of ML models in cases with activity cliffs. Originally integrated from multiple studies in the ChEMBL database, they comprise compounds with diverse scaffolds (Supplementary Table S2). (c) The 3 MMP datasets of AC classification. These three benchmark datasets (Supplementary Table S3) were generated by defining MMP as the MMP-cliff if ΔpKi2, and as MMP-nonCliff if ΔpKi1. (d) The 10 ADMET datasets of delta prediction. These benchmark datasets consist of three absorption datasets, two distribution datasets, an excretion dataset, three metabolism datasets, and a toxicity dataset (Supplementary Table S4). The tasks predict the ADMET delta values of paired compounds.
Figure 3:
Figure 3:. Comparison of model behavior and performance with or without AC-awareness on molecular activity prediction.
(a) Model training without AC-awareness. Illustration of model training where the latent space does not account for activity cliffs (ACs). The learning process only optimizes for regression loss, leading to suboptimal differentiation between anchor (A), positive (P), and negative (N) samples. (b) Model training with AC-awareness. Illustration of model training equipping AC-awareness. The learning process includes triplet soft margin (TSM) loss tsm, which helps better differentiate between A, P, and N samples in the latent space in addition to regression loss mae. (c) Training history and performance of PNA-based ACANET model with/without AC-awareness. Comparison of models with AC-awareness (α=1) and without AC-awareness (α=0, no TSM loss) over training epochs. The plots show training MAE loss, number of mined HV-ACTs (M), validation RMSE, and test RMSE of PPARδ dataset (n = 1125) of HSSMS. Shaded areas represent the standard deviation from 10 repetitions with different random seeds in dataset split.
Figure 4:
Figure 4:. Impact of AC-awareness across different GNN backbones.
The plots show the number of mined HV-ACTs (M) during model training and the final test RMSE on the PPARδ dataset. The medium-sized PPARδ dataset (n = 1125) from the HSSMS collection was split into training, validation, and test sets in a 6:2:2 ratio, repeated ten times with different random seeds. The ACANET models (with four different backbones) were trained with and without AC-awareness, using identical hyperparameters except for the loss function: mae for models without AC-awareness and mae+tsm for models with AC-awareness. (a, b, and c) Number of mined high-value activity cliff triplets (HV-ACTs) in the training, validation, and test sets across four GNN backbones, respectively: Graph Convolutional Network (GCN), Graph Isomorphism Network (GIN), Graph Attention Network (GAT), and Principal Neighborhood Aggregation (PNA). Dashed lines represent the training process without AC-awareness, while solid lines represent the training process with AC-awareness. The significant decrease in the number of HV-ACTs in models with AC-awareness indicates improved learning and handling of activity cliffs as the latent feature space is optimized. (d) Final test set RMSE performance across the same GNN backbones, comparing models trained with and without AC-awareness. The test RMSE performance is based on the model selected by the validation set RMSE. The numerical RMSE values for the test and validation sets with and without AC-awareness are provided in Supplementary Table S6. The equipping of AC-awareness consistently reduces RMSE across all backbones, with statistical significance indicated by * (*=p<0.05, **=p<0.01, ***=p<0.001, ****=p<0.0001).
Figure 5:
Figure 5:. Ablation studies on the impact of cliff cut-offs and the awareness factor on model performance in the BRAF dataset.
(a) Shown are the effects of varying the cliff cut-off parameters cl and cu while keeping the awareness factor α fixed. The number of mined conditional ACTs (M) and validation RMSE are displayed, illustrating how different cl and cu values affect the number of conditional ACTs M and model performance. The pink dots indicate where the number of conditional ACTs M is zero, essentially representing cases without AC-awareness. There is a clear negative correlation between the number of conditional ACTs in the training set and validation RMSE performance. The correlations between the number of conditional triplets M in the training set and validation RMSE performance for the other four LSSNS datasets (PLK1, IDO1, USP7, and RIP2) are shown in Supplementary Figure S2. (b) Examines the impact of varying the awareness factor α while keeping cl and cu constant. The results show the number of HV-ACTs and validation RMSE over training epochs for different α values, highlighting improved performance with higher α values.
Figure 6:
Figure 6:. Comparison of original chemical and latent space learned by models with/without AC-awareness in the BRAF dataset.
(a) Original chemical space of the BRAF dataset. Two example triplets (Triplet-1 and Triplet-2) are shown with their anchor, positive, and negative compounds. The compounds are color-coded based on their pIC50 values. Label incoherence index (LII)=0.490. (b) Latent embedding without AC-awareness (α=0). Principal Component Analysis (PCA) of the latent space without AC-awareness, showing poor separation between anchor/positive and negative samples. Label incoherence index (LII)=0.290. (c) Latent embedding with AC-awareness (α=1). PCA of the latent space with AC-awareness, showing better separation between anchor/positive and negative samples. Label incoherence index (LII)=0.199.
Figure 7:
Figure 7:. Performance comparison of ACA loss-based ACANET against shell-on-head ML approaches on the 30 HSSMS benchmark datasets.
(a) Boxplot of RMSE performance on the entire test set of the 30 HSSMS datasets, where lower RMSE values indicate better performance. (b) Boxplot of RMSE performance on the cliff-specific test set of the 30 HSSMS datasets, where lower RMSE values indicate better performance. The significance of the RMSE improvements by ACANET compared to conventional ML models (SVM, RF, GBM, MLP, and KNN) using ECFP as input was evaluated using paired t-tests. Statistical significance is indicated by asterisks (*=p<0.05,**=p<0.01,***=p<0.001,****=p<0.0001,) or “n.s.” if not significant.

References

    1. Sadybekov A. V. & Katritch V. Computational approaches streamlining drug discovery. Nature 616, 673–685 (2023). - PubMed
    1. Yang K. et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling 59, 3370–3388 (2019). - PMC - PubMed
    1. Wang Y., Wang J., Cao Z. & Barati Farimani A. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence 4, 279–287 (2022).
    1. Chen D. et al. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nature communications 12, 3521 (2021). - PMC - PubMed
    1. Li Y. et al. An adaptive graph learning method for automated molecular interactions and properties predictions. nature machine intelligence 4, 645–651 (2022).

Publication types