Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Jul 5;15(1):5640.
doi: 10.1038/s41467-024-49777-x.

A data science roadmap for open science organizations engaged in early-stage drug discovery

Affiliations
Review

A data science roadmap for open science organizations engaged in early-stage drug discovery

Kristina Edfeldt et al. Nat Commun. .

Abstract

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Data management workflows for protein production and chemical library screening.
Controlled vocabulary and descriptors are used in data management workflows for protein production (top) and DEL screening (bottom). Materials & methods (M&M), protocols and results are recorded in an electronic lab notebook (ELN). DEL: DNA encoded library; ID: identifier; LIMS: laboratory information management system; Tm: melting temperature.
Fig. 2
Fig. 2. Workflow for data archiving and dissemination.
The data archiving and dissemination workflow is a multistep process including data ingestion, creation of well-documented datasets that are made accessible and interoperable for the scientific community to use efficiently.
Fig. 3
Fig. 3. Workflow for computational molecular property prediction.
Computational workflow for predicting molecular properties, starting with molecular structure encoding, followed by model selection and assessment, and concluding with the application of models to virtually screen libraries and rank these molecules for potential experimental validation. The process can be cyclical, allowing iterative refinement of models based on empirical data. ADMET: absorption, distribution, metabolism, and excretion–toxicity. ECFP: Extended Connectivity Fingerprints. CDDD: Continuous Data-Driven Descriptor, a type of molecular representation derived from SMILES strings. Entropy: Shannon entropy descriptors,.

References

    1. Carter AJ, et al. Target 2035: probing the human proteome. Drug Discov. Today. 2019;24:2111–2115. doi: 10.1016/j.drudis.2019.06.020. - DOI - PubMed
    1. For chemists, the AI revolution has yet to happen. Nature617, 438 (2023). - PubMed
    1. Wilkinson MD, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. - DOI - PMC - PubMed
    1. Guarino, N. Formal Ontology and Information Systems. (IOS Press 1998).
    1. Zdrazil B, et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024;52:D1180–D1192. doi: 10.1093/nar/gkad1004. - DOI - PMC - PubMed

LinkOut - more resources