Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Oct 9;68(19):19800-19827.
doi: 10.1021/acs.jmedchem.5c00920. Epub 2025 Sep 19.

Drug and Clinical Candidate Drug Data in ChEMBL

Affiliations
Review

Drug and Clinical Candidate Drug Data in ChEMBL

Fiona M I Hunter et al. J Med Chem. .

Abstract

ChEMBL is a large-scale, open-access, FAIR database of bioactive molecules with drug-like properties. ChEMBL 35 contains 17,500 approved drugs, and drugs that are progressing through the clinical development pipeline. Drug curation has formed an integral part of the core offering of the ChEMBL database since its inception. The paper is a reference guide to present the principles of why the ChEMBL drug data has been curated in a particular manner so that data users can better understand the nature of the data. The drug data include information on: names, synonyms and trade names, chemical structure or biological sequence, data sources, indications, mechanisms, warnings and drug properties such as maximum phase of development, type of molecule, prodrug status and first approval. The integrated nature of the drug data within the context of a bioactivity resource enables the wide use of the data set in drug discovery, AI and machine learning.

PubMed Disclaimer

Figures

1
1
(a) Data for drugs and clinical candidate drugs in ChEMBL are curated from multiple sources of information. (b) Example of typical drug data for pioglitazone hydrochloride (CHEMBL1715).
2
2
Examples and explanation of the fields in the molecule_hierarchy table. (a) The compound (pentazocine hydrochloride, an approved salt drug form) is the dosed ingredient and differs from the parent compound (pentazocine). The parent compound is also the active_molregno (i.e., not a prodrug). (b) The compound (valacyclovir hydrochloride, a prodrug and an approved salt drug form) is the dosed ingredient that differs from the parent compound (valacyclovir). The parent compound is metabolized within the human body to give the pharmacologically active ingredient (acyclovir triphosphate). (c) The compound (cantuzumab mertansine, a prodrug and an antibody drug conjugate in Phase II clinical trials) is the dosed ingredient and is also the parent compound. The compound is metabolized within the human body to give the pharmacologically active ingredient (mertansine) and the antibody component (cantuzumab) is the key to targeting the drug within a specific cell.
3
3
Clinical trials pipeline. (a) Relevant clinical trials are extracted from ClinicalTrials.gov and stored in our internal database staging tables. (b) For each clinical trial, the intervention(s) and condition(s) are mapped to a compound identifier (molregno) and a disease identifier (EFO id). (c) The mapped data for clinical trials are migrated to the ChEMBL database (via our internal “Drugbase” database) for public release and are also delivered to the Open Targets Platform.
4
4
An overview of indications and therapeutic targets. (a) Counts of indications per maximum phase for drugs in ChEMBL 35. The plot shows the number of distinct MeSH identifiers for all drug forms in each compound family per maximum phase category. The inner labels are the categories of maximum phase. The outer labels are the MeSH headings that correspond to each MeSH identifier and the legend shows the number of distinct indications. The top 20 indications for each maximum phase category are shown. (b) Counts of therapeutic targets per action type for drugs in ChEMBL 35. The plot shows the number of distinct target identifiers (tid) for all drug forms in each compound family per action type. The inner labels are the categories of mechanism of action. The outer labels are the preferred target names that correspond to each target identifier and the legend shows the number of distinct targets. The top 20 targets per action type category are shown.
5
5
Drug warning data for CHEMBL 35. (a) The number of withdrawn drugs for each toxicity category. The labels show the count of distinct parent drugs. (b) The number of approved drugs that carry a black box warning for a severe or life-threatening adverse effect for each toxicity category. The labels show the count of distinct parent drugs.
6
6
Number of distinct parent drugs assigned to each maximum phase category and the source of this data.
7
7
Drug properties and other molecule features. (a) The molecule_type categories. (b) An example of the molecule features shown on the compound report card for the approved drug, lenacapavir sodium (CHEMBL4802249). Each molecule icon is depicted as a yes (colored background) or no (uncolored background) unless it is for molecule_type, availability_type or chirality as shown in (a), (c) and (d). *Note that all compounds in ChEMBL, regardless of whether it is a drug or clinical candidate drug, or not, are assigned the natural product flag, chemical probe flag and the flag to show rule-of-five compliance for a drug-like molecule. (c) The availability_type categories. (d) The chirality categories.
8
8
(a) Example of a prodrug nabumetone (CHEMBL1070) and its pharmacologically active ingredient (6-methoxy-2-naphthylacetic acid, CHEMBL1105). (b) Example of the approved drug omeprazole (CHEMBL1503) and its curated metabolic pathway showing its metabolites (including intermediate metabolites).
9
9
(a) Earliest year of approval or USAN application year for ChEMBL 35. The plot shows the cumulative count of distinct parent drugs between the years 1939 and 2023. (b) The distribution of molecule type as a function of the year of approval or the USAN application year. The plot shows the molecule type for the earlier of USAN application year or first approval year for each parent drug. The data are shown for selected time periods (top row and bottom left pie chart), and for drugs where there is no USAN application year nor first approval year assigned (“unknown” year in bottom right pie chart). Note that a parent drug is assigned the same molecule type as other drug forms within each compound family.

References

    1. Zdrazil B.. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024;52:D1180–D1192. doi: 10.1093/nar/gkad1004. - DOI - PMC - PubMed
    1. Gaulton A.. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:D1100–7. doi: 10.1093/nar/gkr777. - DOI - PMC - PubMed
    1. Global Biodata Coalition . Global Core Biodata Resource https://globalbiodata.org/what-we-do/global-core-biodata-resources (accessed 31st March 2025).
    1. Wilkinson M. D.. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. - DOI - PMC - PubMed
    1. Leeson P. D.. et al. Target-Based Evaluation of “Drug-Like” Properties and Ligand Efficiencies. J. Med. Chem. 2021;64:7210–7230. doi: 10.1021/acs.jmedchem.1c00416. - DOI - PMC - PubMed

Substances

LinkOut - more resources