MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information

Wouter Heyndrickx¹, Lewis Mervin², Tobias Morawietz³, Noé Sturm⁴, Lukas Friedrich⁵, Adam Zalewski⁶, Anastasia Pentina⁷, Lina Humbeck⁸, Martijn Oldenhof⁹, Ritsuya Niwayama¹⁰, Peter Schmidtke¹¹, Nikolas Fechner⁴, Jaak Simm⁹, Adam Arany⁹, Nicolas Drizard¹², Rama Jabal¹², Arina Afanasyeva¹³, Regis Loeb⁹, Shlok Verma¹⁴, Simon Harnqvist¹⁴, Matthew Holmes¹⁴, Balazs Pejo¹⁵, Maria Telenczuk¹⁶, Nicholas Holway⁴, Arne Dieckmann¹⁷, Nicola Rieke¹⁸, Friederike Zumsande⁶, Djork-Arné Clevert⁷, Michael Krug⁵, Christopher Luscombe¹⁴, Darren Green¹⁴, Peter Ertl⁴, Peter Antal¹⁹, David Marcus¹⁴, Nicolas Do Huu¹², Hideyoshi Fuji¹³, Stephen Pickett¹⁴, Gergely Acs¹⁵, Eric Boniface²⁰, Bernd Beck⁸, Yax Sun²¹, Arnaud Gohier¹⁰, Friedrich Rippmann⁵, Ola Engkvist²², Andreas H Göller³, Yves Moreau⁹, Mathieu N Galtier²³, Ansgar Schuffenhauer⁴, Hugo Ceulemans¹

Affiliations

¹ Janssen Pharmaceutica NV, Turnhoutseweg 30, Beerse 2340, Belgium.
² AstraZeneca R&D, Biomedical Campus, 1 Francis Crick Ave, Cambridge CB2 0SL, U.K.
³ Bayer Pharma AG, Global Drug Discovery, Chemical Research, Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany.
⁴ Novartis Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland.
⁵ Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany.
⁶ Amgen Research (Munich) GmbH, Staffelseestraße 2, Munich 81477, Germany.
⁷ Bayer AG, Machine Learning Research, Research & Development, Pharmaceuticals, Berlin 10117, Germany.
⁸ BI Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany.
⁹ KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium.
¹⁰ Institut de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France.
¹¹ Discngine, Avenue Ledru Rollin 79, Paris 75012, France.
¹² Iktos, 65 rue de Prony, Paris 75017, France.
¹³ Modality Informatics Group, Digital Research Solutions, Advanced Informatics & Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan.
¹⁴ GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
¹⁵ Budapest University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary.
¹⁶ Owkin, 12 Rue Martel, Paris 75010, France.
¹⁷ Bayer AG, API Production, Product Supply, Pharmaceuticals, Ernst-Schering-Straße 14, Bergkamen 59192, Germany.
¹⁸ NVIDIA GmbH, Floessergasse 2, Munich 81369, Germany.
¹⁹ Budapest University of Technology and Economics, Department of Measurement and Information Systems, Műegyetem rkp. 3, Budapest 1111, Hungary.
²⁰ Substra Foundation - Labelia Labs, 4 rue Voltaire, Nantes 44000, France.
²¹ Amgen Research, 1 Amgen Center Drive, Thousand Oaks, California 92130, United States.
²² AstraZeneca, Molecular AI, Discovery Sciences, R&D, Pepparedsleden 1, Mölndal 431 50, Sweden.
²³ Owkin, 4 Rue Voltaire, Nantes 44000, France.

PMID: 37642660
PMCID: PMC11005050
DOI: 10.1021/acs.jcim.3c00799

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information

Wouter Heyndrickx et al. J Chem Inf Model. 2024.

. 2024 Apr 8;64(7):2331-2344.

doi: 10.1021/acs.jcim.3c00799. Epub 2023 Aug 29.

Authors

Affiliations

¹ Janssen Pharmaceutica NV, Turnhoutseweg 30, Beerse 2340, Belgium.
² AstraZeneca R&D, Biomedical Campus, 1 Francis Crick Ave, Cambridge CB2 0SL, U.K.
³ Bayer Pharma AG, Global Drug Discovery, Chemical Research, Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany.
⁴ Novartis Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland.
⁵ Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany.
⁶ Amgen Research (Munich) GmbH, Staffelseestraße 2, Munich 81477, Germany.
⁷ Bayer AG, Machine Learning Research, Research & Development, Pharmaceuticals, Berlin 10117, Germany.
⁸ BI Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany.
⁹ KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium.
¹⁰ Institut de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France.
¹¹ Discngine, Avenue Ledru Rollin 79, Paris 75012, France.
¹² Iktos, 65 rue de Prony, Paris 75017, France.
¹³ Modality Informatics Group, Digital Research Solutions, Advanced Informatics & Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan.
¹⁴ GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
¹⁵ Budapest University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary.
¹⁶ Owkin, 12 Rue Martel, Paris 75010, France.
¹⁷ Bayer AG, API Production, Product Supply, Pharmaceuticals, Ernst-Schering-Straße 14, Bergkamen 59192, Germany.
¹⁸ NVIDIA GmbH, Floessergasse 2, Munich 81369, Germany.
¹⁹ Budapest University of Technology and Economics, Department of Measurement and Information Systems, Műegyetem rkp. 3, Budapest 1111, Hungary.
²⁰ Substra Foundation - Labelia Labs, 4 rue Voltaire, Nantes 44000, France.
²¹ Amgen Research, 1 Amgen Center Drive, Thousand Oaks, California 92130, United States.
²² AstraZeneca, Molecular AI, Discovery Sciences, R&D, Pepparedsleden 1, Mölndal 431 50, Sweden.
²³ Owkin, 4 Rue Voltaire, Nantes 44000, France.

PMID: 37642660
PMCID: PMC11005050
DOI: 10.1021/acs.jcim.3c00799

Abstract

Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing financial interest(s): M.N.G. and M.T. are employed and own stocks in the company Owkin commercializing the underlying Federated Learning Platform based on the open source Substra software. The remaining authors have no conflicts of interest to declare.

Figures

**Figure 1**
Conceptual representation of the federated setup with two partners of different size, illustrating cross-end point and cross-compound federation. In practice, the number of end points amenable to cross-compound federation is far lower than to cross-end point federation, due to challenges in reconciliation across partners. Identical structures at different partners get identically represented, allowing implicit mapping through the machine learning algorithm without exchanging any sensitive information. Permission has been obtained to use the MELLODDY logo.

**Figure 2**
Overview of the different training modalities with layer sizes commonly optimal for partners for the federated setting (see SI for extensive optimal hyperparameters).

**Figure 3**
Performance deltas (between multi- and single-partner runs) based on median (top) and 90th percentile (bottom) across companies for their respective optimal model (either with or without auxiliary data).

**Figure 4**
Classification performance results from the federated run. (A) Effect of multipartner (MP) and auxiliary data (*) on the median AUC-PR task performance for 5 smaller (dashed lines) and 5 larger (solid lines) partners. (B) Distribution of median AUC-PR task performance (RIPtoP(AUC-PR)) over partners. (C–F) Difference between the empirical cumulative distribution functions (CDFs) from single- and multipartner models for different assay types based on AUC-ROC. The difference between the cumulative proportion of tasks in the multi- versus single-partner models (y-axis) is shown for the binned performance (x-axis). The line plots indicate the median probability difference for that bin over all partners. The interquartile ranges are indicated by the shaded envelope. Mind that AUC-ROC is shown here due to its stable baseline of 0.5 for a random classifier.

**Figure 5**
Classification applicability domain results from the federated run. (A) Effect of multipartner (MP) and auxiliary data (*) on the median task performance for 5 smaller (dashed lines) and 5 larger (solid lines) partners. (B) Distribution of median task performance (RIPtoP(CE)) over partners. (C–F) Difference between the empirical cumulative distribution functions (CDFs) from single- and multipartner models for different assay types based on CE. The difference between the cumulative proportion of tasks in the multi- versus single-partner models (y-axis) is shown for the binned performance (x-axis). The line plots indicate the median probability difference for a bin over partners. The interquartile ranges are indicated with the shaded envelope.

**Figure 6**
Regression performance results from the federated run. (A) Effect of multipartner (MP) and auxiliary data (*) on the median task performance for 5 smaller (dashed lines) and 5 larger (solid lines) partners. (B) Distribution of median task performance RIPtoP(R²)) over partners. (C–F) Difference between the empirical cumulative distribution functions (CDFs) from single- and multipartner models for different assay types based on R². The difference between the cumulative proportion of tasks in the multi- versus single-partner models (y-axis) is shown for the binned performance (x-axis). The line plots indicate the median probability difference for a bin over partners. The interquartile ranges are indicated with the shaded envelope.

See this image and copyright information in PMC

References

1. Hansch C.; Maloney P. P.; Fujita T.; Muir R. M. Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients. Nature 1962, 194, 178–180. 10.1038/194178b0. - DOI
1. Hansch C.; Fujita T. ρ-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure. J. Am. Chem. Soc. 1964, 86, 1616–1626. 10.1021/ja01062a035. - DOI
1. Muratov E. N.; Bajorath J.; Sheridan R. P.; Tetko I. V.; Filimonov D.; Poroikov V.; Oprea T. I.; Baskin I. I.; Varnek A.; Roitberg A.; Isayev O.; Curtalolo S.; Fourches D.; Cohen Y.; Aspuru-Guzik A.; Winkler D. A.; Agrafiotis D.; Cherkasov A.; Tropsha A. QSAR without Borders. Chem. Soc. Rev. 2020, 49, 3525–3564. 10.1039/D0CS00098A. - DOI - PMC - PubMed
1. Tang Y.; Chen K. X.; Jiang H. L.; Ji R. Y. QSAR/QSTR of Fluoroquinolones: An Example of Simultaneous Analysis of Multiple Biological Activities Using Neural Network Method. Eur. J. Med. Chem. 1998, 33, 647–658. 10.1016/S0223-5234(98)80023-8. - DOI
1. González-Díaz H.; Prado-Prado F. J.; Santana L.; Uriarte E. Unify QSAR Approach to Antimicrobials. Part 1: Predicting Antifungal Activity against Different Species. Bioorg. Med. Chem. 2006, 14, 5973–5980. 10.1016/j.bmc.2006.05.018. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information

Affiliations

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources