Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 13;27(8):2513.
doi: 10.3390/molecules27082513.

A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics

Affiliations

A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics

Laura Isigkeit et al. Molecules. .

Abstract

Publicly available compound and bioactivity databases provide an essential basis for data-driven applications in life-science research and drug design. By analyzing several bioactivity repositories, we discovered differences in compound and target coverage advocating the combined use of data from multiple sources. Using data from ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs, we assembled a consensus dataset focusing on small molecules with bioactivity on human macromolecular targets. This allowed an improved coverage of compound space and targets, and an automated comparison and curation of structural and bioactivity data to reveal potentially erroneous entries and increase confidence. The consensus dataset comprised of more than 1.1 million compounds with over 10.9 million bioactivity data points with annotations on assay type and bioactivity confidence, providing a useful ensemble for computational applications in drug design and chemogenomics.

Keywords: big data; data curation; de novo design; machine learning; medicinal chemistry.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

Figure 1
Figure 1
Analysis of individual compound/bioactivity databases to assemble a consensus dataset (C = ChEMBL, PC = PubChem, B = BindingDB, I = IUPHAR/BPS, and PD = Probes & Drugs). From PubChem, only compounds that were also included in at least one of the other databases were considered and the data in PubChem referring to these compounds were used to validate and curate chemical structure and biological activities. (a,b) Venn diagrams of the data collected from the five different databases and created by compound ID (a) or SMILES (b). The numbers of shared and not-shared molecules are reported and the numbers of unique scaffolds are in round brackets. (c) The five-most frequently occurring Murcko scaffolds present in the databases. (d) Distribution of molecular weight (MW), number of aromatic rings, rotatable bonds, and octanol–water partition coefficients (cLogP) per database. (e) UMAP of 2000 randomly selected molecules from each database. (f) Percentage pie chart of records labeled as active (activity log-value higher than 6), weakly active (activity log-value between 5 and 6), inactive (activity log-value lower than 5 or labeled inactive), not specified (no activity log-value), and no data point. Each ring represents a database in the order shown in the legend. Percentages below 5% are not displayed.
Figure 1
Figure 1
Analysis of individual compound/bioactivity databases to assemble a consensus dataset (C = ChEMBL, PC = PubChem, B = BindingDB, I = IUPHAR/BPS, and PD = Probes & Drugs). From PubChem, only compounds that were also included in at least one of the other databases were considered and the data in PubChem referring to these compounds were used to validate and curate chemical structure and biological activities. (a,b) Venn diagrams of the data collected from the five different databases and created by compound ID (a) or SMILES (b). The numbers of shared and not-shared molecules are reported and the numbers of unique scaffolds are in round brackets. (c) The five-most frequently occurring Murcko scaffolds present in the databases. (d) Distribution of molecular weight (MW), number of aromatic rings, rotatable bonds, and octanol–water partition coefficients (cLogP) per database. (e) UMAP of 2000 randomly selected molecules from each database. (f) Percentage pie chart of records labeled as active (activity log-value higher than 6), weakly active (activity log-value between 5 and 6), inactive (activity log-value lower than 5 or labeled inactive), not specified (no activity log-value), and no data point. Each ring represents a database in the order shown in the legend. Percentages below 5% are not displayed.
Figure 2
Figure 2
Percentage pie chart of the bioactivity labels for seven important target families and detailed numbers of associated targets, compounds, and bioactivities (C = ChEMBL, PC = PubChem, B = BindingDB, I = IUPHAR/BPS, and PD = Probes & Drugs). Each ring represents a database in the order shown in the legend. Percentages below 5% are not displayed.
Figure 3
Figure 3
Structure of the consensus dataset.
Figure 4
Figure 4
Percentage pie chart of the bioactivity labels of the complete consensus dataset for seven important target families with detailed numbers of associated targets, compounds, and bioactivities.
Figure 5
Figure 5
Differences in structural data from different source databases (examples).

References

    1. Mendez D., Gaulton A., Bento A.P., Chambers J., de Veij M., Félix E., Magariños M.P., Mosquera J.F., Mutowo P., Nowotka M., et al. ChEMBL: Towards Direct Deposition of Bioassay Data. Nucleic Acids Res. 2019;47:D930–D940. doi: 10.1093/nar/gky1075. - DOI - PMC - PubMed
    1. Kim S., Chen J., Cheng T., Gindulyte A., He J., He S., Li Q., Shoemaker B.A., Thiessen P.A., Yu B., et al. PubChem in 2021: New Data Content and Improved Web Interfaces. Nucleic Acids Res. 2021;49:D1388–D1395. doi: 10.1093/nar/gkaa971. - DOI - PMC - PubMed
    1. Gilson M.K., Liu T., Baitaluk M., Nicola G., Hwang L., Chong J. BindingDB in 2015: A Public Database for Medicinal Chemistry, Computational Chemistry and Systems Pharmacology. Nucleic Acids Res. 2016;44:D1045–D1053. doi: 10.1093/nar/gkv1072. - DOI - PMC - PubMed
    1. Harding S.D., Armstrong J.F., Faccenda E., Southan C., Alexander S.P.H., Davenport A.P., Pawson A.J., Spedding M., Davies J.A. The IUPHAR/BPS Guide to PHARMACOLOGY in 2022: Curating Pharmacology for COVID-19, Malaria and Antibacterials. Nucleic Acids Res. 2022;50:D1282–D1294. doi: 10.1093/nar/gkab1010. - DOI - PMC - PubMed
    1. Škuta C., Southan C., Bartůněk P. Will the Chemical Probes Please Stand Up? RSC Med. Chem. 2021;12:1428–1441. doi: 10.1039/D1MD00138H. - DOI - PMC - PubMed

LinkOut - more resources