Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 3;16(1):4121.
doi: 10.1038/s41467-025-59418-6.

Labels as a feature: Network homophily for systematically annotating human GPCR drug-target interactions

Affiliations

Labels as a feature: Network homophily for systematically annotating human GPCR drug-target interactions

Frederik G Hansson et al. Nat Commun. .

Abstract

Machine learning has revolutionized drug discovery by enabling the exploration of vast, uncharted chemical spaces essential for discovering novel patentable drugs. Despite the critical role of human G protein-coupled receptors in FDA-approved drugs, exhaustive in-distribution drug-target interaction testing across all pairs of human G protein-coupled receptors and known drugs is rare due to significant economic and technical challenges. This often leaves off-target effects unexplored, which poses a considerable risk to drug safety. In contrast to the traditional focus on out-of-distribution exploration (drug discovery), we introduce a neighborhood-to-prediction model termed Chemical Space Neural Networks that leverages network homophily and training-free graph neural networks with labels as features. We show that Chemical Space Neural Networks' ability to make accurate predictions strongly correlates with network homophily. Thus, labels as features strongly increase a machine learning model's capacity to enhance in-distribution prediction accuracy, which we show by integrating labeled data during inference. We validate these advancements in a high-throughput yeast biosensing system (3773 drug-target interactions, 539 compounds, 7 human G protein-coupled receptors) to discover novel drug-target interactions for FDA-approved drugs and to expand the general understanding of how to build reliable predictors to guide experimental verification.

PubMed Disclaimer

Conflict of interest statement

Competing interests: J.D.K., L.G.H. and M.K.J. are inventors on pending patent applications (patent applicant: Technical University of Denmark; application number: PCT/EP2023/063481). L.G.H., J.D.K. and M.K.J. have financial interests in Biomia. J.D.K. also has financial interests in Amyris, Lygos, Demetrix, Napigen, Apertor Pharmaceuticals, Maple Bio, Ansa Biotechnologies, Berkeley Yeast and Zero Acre Farms. All other authors have no competing interests.

Figures

Fig. 1
Fig. 1. Introducing Network Homophily and Transductive Node Classification.
a Collected data on bioactivity classes for 186 K unique hGPCR-targeting compounds across 128 hGPCRs only has 369 K of 23.9 M possible activities. This indicates the sparse annotation in public databases. b A visual example to introduce network homophily: “similarity breeds connection". c Inference for CSNN to illustrate how labels as a feature (LaFs) are used. First, the query compound is encoded by a bit-vector, then a one-vs-all database search returns compounds, which are chemically similar and labels on those compounds. A neighbourhood graph Gi is constructed that is fed to the ML network (fθ). d Instead of transductive node classification, we simplify the task to transductive graph classification by introducing a one-hop directed graph (incident on the query) with neighbourhood LaFs as edge attributes. e The CSNN framework can be viewed as a composed message-passing neural network (MPNN): First at the molecule level (MPNN on atoms) and then again at the neighbourhood level, including LaFs.
Fig. 2
Fig. 2. Benchmarking Bioactivity Label Prediction with LaFs.
a The adjacancy matrix (a subset of the 186 K × 186 K possible connections) parameterised the all-vs-all chemical space neighbourhood. b Querying the CSN can return data on related compounds already illustrating graph homophily (agonists are over represented and the true label for OPRK1 is agonist). c Training free prediction metrics using the most frequent LaF in the neighbourhood demonstrates the strong network homophily. d Structuring the dataset into an accessible format and CSNN architecture. e Prediction metrics on test set for ML methods: without LaFs (MLP, RF), with LaFs (MLP + N, CSNN), and a full MPNN on chemical neighbourhoods (CSNN) compared to the training free prediction (Argmax). Error bars represent +/- one standard deviation. f When the class label is confident probability>0.8, the CSNN method (referred to as NNθ6) produces high-quality class discrimination. g As NNθ6 class label is filtered by logits probability, the performance metrics tend toward perfect predictions. h NNθ128 CSNN model (one forward pass for class labels across all 128 hGPCRs using LaFs) shows strong performance metrics for most hGPCRs. Error bars represent +/- one standard deviation. The two models are contrasted explicitly in terms of their input-output parameters in Supplementary Fig. S2. Source data are provided in the Source Data file for panels (c, e, g and h).
Fig. 3
Fig. 3. Benchmarking LaFs in the Regression Setting.
a An illustrated training example: node 0 is the query, nodes 1–3 are neighbouring compounds and their bioactivity label (LaF). The mean over the neighbourhood is 6.20, which is close to the true label of 6.14 (units: log10(Ki)). b Training free mean value (〈Yneigh〉) predictions on the test set using LaFs across all hGPCRs in the pdCSM dataset shows strong performance metrics. c Comparing the published pdCSM method (RF on RDKit compound representations) with LaFs (CSNN) on the same compound representations. In most cases, LaFs improve the Pearson correlation coefficient. The 〈Yneigh〉 prediction closely tracks the top-performing method. d Illustrating the effect of a low-capacity Ridge regression model without (column: Ridge) and with LaFs (column: Ridge + N). The input dimension differs only by one column (the mean value over the neighbourhood). e Interestingly, the test metrics on a given hGPCR correlates strongly with the Pearson correlation under the homophily assumption (mean value is a good prediction). The LaF methods outperform non-LaF methods (compound-to-prediction architectures). The p-value of the Pearson correlation coefficient is calculated using the two-sided t-statistic test. Source data are provided in the Source Data file for panels (b, c, and e).
Fig. 4
Fig. 4. Experimental Validation using a Developed HT-yeast Biosensing Platform.
a Overview of experimental HT-yeast platform based on designs by Shaw et al. 2019. The drug may induce the hGPCR, and the signal is measured by a reporter in relative luminescence units (RLU). b Dose-response curves (DRCs) reveal large differences in dynamic range and sensitivity (See all hGPCR DRCs in Supplementary Fig. S21). c Overview over all-vs-all 3773 DTIs measured (539 compounds and 7 hGPCRs, left) with top hits and their associated Z-score (right), which recovers known agonists. d Using the CSN as a knowledge graph, we compare the signal response in HT-yeast of DTI with that found in mammalian systems and find a high correspondance. e Heatmap of predicted label and the experimental Z-score for DTIs. Argmax predictions and neural network predictions can correctly filter out no-effect DTIs. Naturally, the assay cannot capture some of the predicted labels in the Z-score and thus does not correlate with the Z-score bin. f CSNN can effectively be used to enrich the positive hits (Z-score > 3), with high specificity (0.948) and an enriched hit rate of 18%. A predicted hit was defined as the CSNN prediction for Ki < 100 nM, and the label is not `No Effect'. The precision is markedly low (0.182), showing difficulty in positive hits. g Novel hits without prior art in literature. We use the CSN as a knowledge graph, sort for significant hits (∣Z-score∣ > 3), without any database knowledge in the chemical neighbourhood. We further validated by a manual literature search. See additionally Supplementary Tables S4 and S12. Source data are provided in the Source Data file for panels (bd).

References

    1. Catacutan, D. B., Alexander, J., Arnold, A. & Stokes, J. M. Machine learning in preclinical drug discovery. Nat. Chem. Biol.20, 960–973 (2024). - PubMed
    1. Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov.18, 463–477 (2019). - PMC - PubMed
    1. Alhosaini, K., Azhar, A., Alonazi, A. & Al-Zoghaibi, F. Gpcrs: The most promiscuous druggable receptor of the mankind. Saudi Pharm. J.29, 539–551 (2021). - PMC - PubMed
    1. Sriram, K. & Insel, P. A. G Protein-Coupled receptors as targets for approved drugs: How many targets and how many drugs? Mol. Pharmacol.93, 251–258 (2018). - PMC - PubMed
    1. Insel, P. A., Tang, C.-M., Hahntow, I. & Michel, M. C. Impact of gpcrs in clinical medicine: Monogenic diseases, genetic variants and drug targets. Biochim. Biophys. Acta Biomembr.1768, 994–1005 (2007). - PMC - PubMed

Substances