Review

. 2025 Oct 9;68(19):19800-19827.

doi: 10.1021/acs.jmedchem.5c00920. Epub 2025 Sep 19.

Drug and Clinical Candidate Drug Data in ChEMBL

Affiliations

PMID: 40971497
PMCID: PMC12516679
DOI: 10.1021/acs.jmedchem.5c00920

Review

Drug and Clinical Candidate Drug Data in ChEMBL

Fiona M I Hunter et al. J Med Chem. 2025.

. 2025 Oct 9;68(19):19800-19827.

doi: 10.1021/acs.jmedchem.5c00920. Epub 2025 Sep 19.

Affiliation

¹ European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

PMID: 40971497
PMCID: PMC12516679
DOI: 10.1021/acs.jmedchem.5c00920

Abstract

ChEMBL is a large-scale, open-access, FAIR database of bioactive molecules with drug-like properties. ChEMBL 35 contains 17,500 approved drugs, and drugs that are progressing through the clinical development pipeline. Drug curation has formed an integral part of the core offering of the ChEMBL database since its inception. The paper is a reference guide to present the principles of why the ChEMBL drug data has been curated in a particular manner so that data users can better understand the nature of the data. The drug data include information on: names, synonyms and trade names, chemical structure or biological sequence, data sources, indications, mechanisms, warnings and drug properties such as maximum phase of development, type of molecule, prodrug status and first approval. The integrated nature of the drug data within the context of a bioactivity resource enables the wide use of the data set in drug discovery, AI and machine learning.

PubMed Disclaimer

Figures

1
(a) Data for drugs and clinical candidate drugs in ChEMBL are curated from multiple sources of information. (b) Example of typical drug data for pioglitazone hydrochloride (CHEMBL1715).

2
Examples and explanation of the fields in the *molecule_hierarchy* table. (a) The compound (pentazocine hydrochloride, an approved salt drug form) is the dosed ingredient and differs from the parent compound (pentazocine). The parent compound is also the *active_molregno* (i.e., not a prodrug). (b) The compound (valacyclovir hydrochloride, a prodrug and an approved salt drug form) is the dosed ingredient that differs from the parent compound (valacyclovir). The parent compound is metabolized within the human body to give the pharmacologically active ingredient (acyclovir triphosphate). (c) The compound (cantuzumab mertansine, a prodrug and an antibody drug conjugate in Phase II clinical trials) is the dosed ingredient and is also the parent compound. The compound is metabolized within the human body to give the pharmacologically active ingredient (mertansine) and the antibody component (cantuzumab) is the key to targeting the drug within a specific cell.

3
Clinical trials pipeline. (a) Relevant clinical trials are extracted from ClinicalTrials.gov and stored in our internal database staging tables. (b) For each clinical trial, the intervention(s) and condition(s) are mapped to a compound identifier (*molregno*) and a disease identifier (EFO id). (c) The mapped data for clinical trials are migrated to the ChEMBL database (via our internal “Drugbase” database) for public release and are also delivered to the Open Targets Platform.

4
An overview of indications and therapeutic targets. (a) Counts of indications per maximum phase for drugs in ChEMBL 35. The plot shows the number of distinct MeSH identifiers for all drug forms in each compound family per maximum phase category. The inner labels are the categories of maximum phase. The outer labels are the MeSH headings that correspond to each MeSH identifier and the legend shows the number of distinct indications. The top 20 indications for each maximum phase category are shown. (b) Counts of therapeutic targets per action type for drugs in ChEMBL 35. The plot shows the number of distinct target identifiers (*tid*) for all drug forms in each compound family per action type. The inner labels are the categories of mechanism of action. The outer labels are the preferred target names that correspond to each target identifier and the legend shows the number of distinct targets. The top 20 targets per action type category are shown.

5
Drug warning data for CHEMBL 35. (a) The number of withdrawn drugs for each toxicity category. The labels show the count of distinct parent drugs. (b) The number of approved drugs that carry a black box warning for a severe or life-threatening adverse effect for each toxicity category. The labels show the count of distinct parent drugs.

6
Number of distinct parent drugs assigned to each maximum phase category and the source of this data.

7
Drug properties and other molecule features. (a) The molecule_type categories. (b) An example of the molecule features shown on the compound report card for the approved drug, lenacapavir sodium (CHEMBL4802249). Each molecule icon is depicted as a yes (colored background) or no (uncolored background) unless it is for *molecule_type*, *availability_type* or *chirality* as shown in (a), (c) and (d). *Note that all compounds in ChEMBL, regardless of whether it is a drug or clinical candidate drug, or not, are assigned the natural product flag, chemical probe flag and the flag to show rule-of-five compliance for a drug-like molecule. (c) The *availability_type* categories. (d) The *chirality* categories.

8
(a) Example of a prodrug nabumetone (CHEMBL1070) and its pharmacologically active ingredient (6-methoxy-2-naphthylacetic acid, CHEMBL1105). (b) Example of the approved drug omeprazole (CHEMBL1503) and its curated metabolic pathway showing its metabolites (including intermediate metabolites).

9
(a) Earliest year of approval or USAN application year for ChEMBL 35. The plot shows the cumulative count of distinct parent drugs between the years 1939 and 2023. (b) The distribution of molecule type as a function of the year of approval or the USAN application year. The plot shows the molecule type for the earlier of USAN application year or first approval year for each parent drug. The data are shown for selected time periods (top row and bottom left pie chart), and for drugs where there is no USAN application year nor first approval year assigned (“unknown” year in bottom right pie chart). Note that a parent drug is assigned the same molecule type as other drug forms within each compound family.

See this image and copyright information in PMC

References

1. Zdrazil B.. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024;52:D1180–D1192. doi: 10.1093/nar/gkad1004. - DOI - PMC - PubMed
1. Gaulton A.. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:D1100–7. doi: 10.1093/nar/gkr777. - DOI - PMC - PubMed
1. Global Biodata Coalition . Global Core Biodata Resource https://globalbiodata.org/what-we-do/global-core-biodata-resources (accessed 31st March 2025).
1. Wilkinson M. D.. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. - DOI - PMC - PubMed
1. Leeson P. D.. et al. Target-Based Evaluation of “Drug-Like” Properties and Ligand Efficiencies. J. Med. Chem. 2021;64:7210–7230. doi: 10.1021/acs.jmedchem.1c00416. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
- American Chemical Society
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Drug and Clinical Candidate Drug Data in ChEMBL

Affiliation

Drug and Clinical Candidate Drug Data in ChEMBL

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources