Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar 29:11:159.
doi: 10.1186/1471-2105-11-159.

CDK-Taverna: an open workflow environment for cheminformatics

Affiliations

CDK-Taverna: an open workflow environment for cheminformatics

Thomas Kuhn et al. BMC Bioinformatics. .

Abstract

Background: Small molecules are of increasing interest for bioinformatics in areas such as metabolomics and drug discovery. The recent release of large open access chemistry databases generates a demand for flexible tools to process them and discover new knowledge. To freely support open science based on these data resources, it is desirable for the processing tools to be open source and available for everyone.

Results: Here we describe a novel combination of the workflow engine Taverna and the cheminformatics library Chemistry Development Kit (CDK) resulting in a open source workflow solution for cheminformatics. We have implemented more than 160 different workers to handle specific cheminformatics tasks. We describe the applications of CDK-Taverna in various usage scenarios.

Conclusions: The combination of the workflow engine Taverna and the Chemistry Development Kit provides the first open source cheminformatics workflow solution for the biosciences. With the Taverna-community working towards a more powerful workflow engine and a more user-friendly user interface, CDK-Taverna has the potential to become a free alternative to existing proprietary workflow tools.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Workflow performing a topological substructure search (Scenario 1) on molecules from a MDL SDfile [35]. The input of this workflow is a SMILES string which represents the substructure.
Figure 2
Figure 2
Workflow performing a substructure search (Scenario 1) in a database [36]. The substructure is defined with a SMILES string. The output is a PDF file with a tabular view of the molecules from the database containing the substructure.
Figure 3
Figure 3
Workflow calculating various QSAR descriptors (Scenario 2) for molecules from a PostgreSQL database. The results of the calculation are stored in a CSV file.
Figure 4
Figure 4
User interface to select QSAR descriptors to be calculated for each molecule during the execution of the descriptor calculation workflow shown in Figure 3.
Figure 5
Figure 5
Overview of the time needed to calculate different molecular descriptors for 1000 molecules [37].
Figure 6
Figure 6
Workflow iteratively calculating different QSAR descriptors (Scenario 3) for molecules loaded from a PostgreSQL database [38]. The results are stored in a CSV file.
Figure 7
Figure 7
Workflow for iterative loading of molecules from a database and searches for molecules with atom types unknown to the Chemistry Development Kit (Scenario 4).
Figure 8
Figure 8
Allocation of the unknown atom types detected during the analysis of the ChEBI database (12367 molecules). A total of 2414 atoms in 1035 molecules (8.36%) did not have a recognized atom type. X1 summarizes unrecognized atom types for the elements Am, Cf, Cm, Dy, Es, Fm, Ga, Lr, Md, Na, Nb, No, Np, Pm, Pu, Sm, Tb, Tc, Th, and Ti.
Figure 9
Figure 9
Reaction enumeration (Scenario 5) loading a generic reaction from a MDL RXNfile and two reactant lists from MDL SDfiles. The products from the enumeration are stored as MDL Molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to allow visualization and analysis of the results [39].
Figure 10
Figure 10
Reaction enumeration example with two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.
Figure 11
Figure 11
Workflow for loading molecular descriptor data vectors from a database, followed by a ART 2-A clustering (Scenario 6).
Figure 12
Figure 12
Occupancies of six different detected clusters for two proprietary natural product databases (yellow and red) with the ChEBI database (blue), highlighting the unique character of the ChEBI database.

Similar articles

Cited by

References

    1. The PubChem Project. http://pubchem.ncbi.nlm.nih.gov/
    1. Irwin J, Shoichet B. ZINC - A Free Database of Commercially Available Compounds for Virtual Screening. Journal of Chemical Information and Modeling. 2005;45:177–182. doi: 10.1021/ci049714+. - DOI - PMC - PubMed
    1. The ChEMBL Group. http://www.ebi.ac.uk/chembl
    1. Williams AJ. Public chemical compound databases. Current opinion in drug discovery & development. 2008;11(3):393–404. - PubMed
    1. Hassan M, Brown RD, Varma-O'brien S, Rogers D. Cheminformatics analysis and learning in a data pipelining environment. Molecular diversity. 2006;10(3):283–299. doi: 10.1007/s11030-006-9041-5. - DOI - PubMed

LinkOut - more resources