. 2010 Mar 29:11:159.

doi: 10.1186/1471-2105-11-159.

CDK-Taverna: an open workflow environment for cheminformatics

Thomas Kuhn¹, Egon L Willighagen, Achim Zielesny, Christoph Steinbeck

Affiliations

PMID: 20346188
PMCID: PMC2862046
DOI: 10.1186/1471-2105-11-159

CDK-Taverna: an open workflow environment for cheminformatics

Thomas Kuhn et al. BMC Bioinformatics. 2010.

. 2010 Mar 29:11:159.

doi: 10.1186/1471-2105-11-159.

Authors

Thomas Kuhn¹, Egon L Willighagen, Achim Zielesny, Christoph Steinbeck

Affiliation

¹ Chemoinformatics and Metabolism, European Bioinformatics Institute, Cambridge, UK.

PMID: 20346188
PMCID: PMC2862046
DOI: 10.1186/1471-2105-11-159

Abstract

Background: Small molecules are of increasing interest for bioinformatics in areas such as metabolomics and drug discovery. The recent release of large open access chemistry databases generates a demand for flexible tools to process them and discover new knowledge. To freely support open science based on these data resources, it is desirable for the processing tools to be open source and available for everyone.

Results: Here we describe a novel combination of the workflow engine Taverna and the cheminformatics library Chemistry Development Kit (CDK) resulting in a open source workflow solution for cheminformatics. We have implemented more than 160 different workers to handle specific cheminformatics tasks. We describe the applications of CDK-Taverna in various usage scenarios.

Conclusions: The combination of the workflow engine Taverna and the Chemistry Development Kit provides the first open source cheminformatics workflow solution for the biosciences. With the Taverna-community working towards a more powerful workflow engine and a more user-friendly user interface, CDK-Taverna has the potential to become a free alternative to existing proprietary workflow tools.

PubMed Disclaimer

Figures

**Figure 1**
**Workflow performing a topological substructure search (Scenario 1) on molecules from a MDL SDfile** [35]. The input of this workflow is a SMILES string which represents the substructure.

**Figure 2**
**Workflow performing a substructure search (Scenario 1) in a database** [36]. The substructure is defined with a SMILES string. The output is a PDF file with a tabular view of the molecules from the database containing the substructure.

**Figure 3**
**Workflow calculating various QSAR descriptors (Scenario 2) for molecules from a PostgreSQL database**. The results of the calculation are stored in a CSV file.

**Figure 4**
**User interface to select QSAR descriptors to be calculated for each molecule during the execution of the descriptor calculation workflow shown in Figure 3**.

**Figure 5**
**Overview of the time needed to calculate different molecular descriptors for 1000 molecules** [37].

**Figure 6**
**Workflow iteratively calculating different QSAR descriptors (Scenario 3) for molecules loaded from a PostgreSQL database** [38]. The results are stored in a CSV file.

**Figure 7**
**Workflow for iterative loading of molecules from a database and searches for molecules with atom types unknown to the Chemistry Development Kit (Scenario 4)**.

**Figure 8**
**Allocation of the unknown atom types detected during the analysis of the ChEBI database (12367 molecules)**. A total of 2414 atoms in 1035 molecules (8.36%) did not have a recognized atom type. X1 summarizes unrecognized atom types for the elements Am, Cf, Cm, Dy, Es, Fm, Ga, Lr, Md, Na, Nb, No, Np, Pm, Pu, Sm, Tb, Tc, Th, and Ti.

**Figure 9**
**Reaction enumeration (Scenario 5) loading a generic reaction from a MDL RXNfile and two reactant lists from MDL SDfiles**. The products from the enumeration are stored as MDL Molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to allow visualization and analysis of the results [39].

**Figure 10**
**Reaction enumeration example with two building blocks**. For each building block, a list of three reactants is defined. This enumeration results in nine different products.

**Figure 11**
**Workflow for loading molecular descriptor data vectors from a database, followed by a ART 2-A clustering (Scenario 6)**.

**Figure 12**
**Occupancies of six different detected clusters for two proprietary natural product databases (yellow and red) with the ChEBI database (blue), highlighting the unique character of the ChEBI database**.

See this image and copyright information in PMC

Cited by

Towards reproducible computational drug discovery.
Schaduangrat N, Lampa S, Simeon S, Gleeson MP, Spjuth O, Nantasenamat C. Schaduangrat N, et al. J Cheminform. 2020 Jan 28;12(1):9. doi: 10.1186/s13321-020-0408-x. J Cheminform. 2020. PMID: 33430992 Free PMC article. Review.
Scaffold Hunter: a comprehensive visual analytics framework for drug discovery.
Schäfer T, Kriege N, Humbeck L, Klein K, Koch O, Mutzel P. Schäfer T, et al. J Cheminform. 2017 May 11;9(1):28. doi: 10.1186/s13321-017-0213-3. J Cheminform. 2017. PMID: 29086162 Free PMC article.
Natural product-likeness score revisited: an open-source, open-data implementation.
Jayaseelan KV, Moreno P, Truszkowski A, Ertl P, Steinbeck C. Jayaseelan KV, et al. BMC Bioinformatics. 2012 May 20;13:106. doi: 10.1186/1471-2105-13-106. BMC Bioinformatics. 2012. PMID: 22607271 Free PMC article.
Chembench: a cheminformatics workbench.
Walker T, Grulke CM, Pozefsky D, Tropsha A. Walker T, et al. Bioinformatics. 2010 Dec 1;26(23):3000-1. doi: 10.1093/bioinformatics/btq556. Epub 2010 Sep 30. Bioinformatics. 2010. PMID: 20889496 Free PMC article.
Increasing the Value of Data Within a Large Pharmaceutical Company Through In Silico Models.
Brigo A, Naga D, Muster W. Brigo A, et al. Methods Mol Biol. 2022;2425:637-674. doi: 10.1007/978-1-0716-1960-5_24. Methods Mol Biol. 2022. PMID: 35188649

See all "Cited by" articles

References

1. The PubChem Project. http://pubchem.ncbi.nlm.nih.gov/
1. Irwin J, Shoichet B. ZINC - A Free Database of Commercially Available Compounds for Virtual Screening. Journal of Chemical Information and Modeling. 2005;45:177–182. doi: 10.1021/ci049714+. - DOI - PMC - PubMed
1. The ChEMBL Group. http://www.ebi.ac.uk/chembl
1. Williams AJ. Public chemical compound databases. Current opinion in drug discovery & development. 2008;11(3):393–404. - PubMed
1. Hassan M, Brown RD, Varma-O'brien S, Rogers D. Cheminformatics analysis and learning in a data pipelining environment. Molecular diversity. 2006;10(3):283–299. doi: 10.1007/s11030-006-9041-5. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CDK-Taverna: an open workflow environment for cheminformatics

Affiliation

CDK-Taverna: an open workflow environment for cheminformatics

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources