Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 1;12(1):51.
doi: 10.1186/s13321-020-00456-1.

An open source chemical structure curation pipeline using RDKit

Affiliations

An open source chemical structure curation pipeline using RDKit

A Patrícia Bento et al. J Cheminform. .

Abstract

Background: The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.

Results: A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures.

Conclusion: All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.

Keywords: ChEMBL; Chemistry; Curation; Open source; RDKit; Standardisation.

PubMed Disclaimer

Conflict of interest statement

The authors have no competing interests.

Figures

Fig. 1
Fig. 1
Examples of the multicomponent forms of paroxetine and amphetamine and how they have been aggregated by use of the GetParent component
Fig. 2
Fig. 2
Examples of standardisations that have been applied to a set of compounds. The compound structure before and after standardisation is shown. a Fix hypervalent nitro groups, b remove explicit H atoms, c fix covalently drawn alkaline metals connected to O or N to ionic forms, d Standardise sulphoxides to charge separated form, e normalise (straighten) allene bonds
Fig. 3
Fig. 3
Examples of compounds from the ChEMBL literature set where the InChIKey changed on standardisation due to the rebalancing of the charge on the compound
Fig. 4
Fig. 4
Examples of approved drugs standardised by the ChEMBL RDKit Standardizer and the PubChem standardiser
Fig. 5
Fig. 5
The composition and number of the compounds containing more than one component in ChEMBL 26 as identified by the GetParent module. The numbers in brackets refer to the number of compounds in each grouping that contain isotopes
Fig. 6
Fig. 6
Examples of applying the GetParent module to some representative ChEMBL compounds containing varying combinations of salts, isotopes and solvents. The “Child” is the compound before and “Parent” the compound after the process has been applied

References

    1. Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930–D940. doi: 10.1093/nar/gky1075. - DOI - PMC - PubMed
    1. Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44(D1):D1045–D1053. doi: 10.1093/nar/gkv1072. - DOI - PMC - PubMed
    1. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102. doi: 10.1093/nar/gky1033. - DOI - PMC - PubMed
    1. Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, Laufer J. Description of several chemical structure formats used by computer programs developed at molecular design limited. J Chem Inf Comput Sci. 1992;32:244–255. doi: 10.1021/ci00007a012. - DOI
    1. Weininger D. SMILES, a chemical langaugeand information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–36. doi: 10.1021/ci00057a005. - DOI

LinkOut - more resources