. 2011 Dec 13:3:54.

doi: 10.1186/1758-2946-3-54.

New developments on the cheminformatics open workflow environment CDK-Taverna

Andreas Truszkowski¹, Kalai Vanii Jayaseelan, Stefan Neumann, Egon L Willighagen, Achim Zielesny, Christoph Steinbeck

Affiliations

PMID: 22166170
PMCID: PMC3292505
DOI: 10.1186/1758-2946-3-54

New developments on the cheminformatics open workflow environment CDK-Taverna

Andreas Truszkowski et al. J Cheminform. 2011.

. 2011 Dec 13:3:54.

doi: 10.1186/1758-2946-3-54.

Authors

Andreas Truszkowski¹, Kalai Vanii Jayaseelan, Stefan Neumann, Egon L Willighagen, Achim Zielesny, Christoph Steinbeck

Affiliation

¹ Chemoinformatics and Metabolism, European Bioinformatics Institute (EBI), Cambridge, UK. steinbeck@ebi.ac.uk.

PMID: 22166170
PMCID: PMC3292505
DOI: 10.1186/1758-2946-3-54

Abstract

Background: The computational processing and analysis of small molecules is at heart of cheminformatics and structural bioinformatics and their application in e.g. metabolomics or drug discovery. Pipelining or workflow tools allow for the Lego™-like, graphical assembly of I/O modules and algorithms into a complex workflow which can be easily deployed, modified and tested without the hassle of implementing it into a monolithic application. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna, the Chemistry Development Kit (CDK) or the Waikato Environment for Knowledge Analysis (WEKA). A first integrated version 1.0 of CDK-Taverna was recently released to the public.

Results: The CDK-Taverna project was migrated to the most up-to-date versions of its foundational software libraries with a complete re-engineering of its worker's architecture (version 2.0). 64-bit computing and multi-core usage by paralleled threads are now supported to allow for fast in-memory processing and analysis of large sets of molecules. Earlier deficiencies like workarounds for iterative data reading are removed. The combinatorial chemistry related reaction enumeration features are considerably enhanced. Additional functionality for calculating a natural product likeness score for small molecules is implemented to identify possible drug candidates. Finally the data analysis capabilities are extended with new workers that provide access to the open-source WEKA library for clustering and machine learning as well as training and test set partitioning. The new features are outlined with usage scenarios.

Conclusions: CDK-Taverna 2.0 as an open-source cheminformatics workflow solution matured to become a freely available and increasingly powerful tool for the biosciences. The combination of the new CDK-Taverna worker family with the already available workflows developed by a lively Taverna community and published on myexperiment.org enables molecular scientists to quickly calculate, process and analyse molecular data as typically found in e.g. today's systems biology scenarios.

PubMed Disclaimer

Figures

**Figure 1**
**Advanced reaction enumeration features: (left) The *Variable RGroup* feature allows the definition of chemical groups which can be flexibly attached to predefined atoms**. (middle) The Atom Alias feature offers the possibility to define a wild card for preconfigured elements. (right) The Expandable Atom feature enables the definition of freely sizeable rings or aliphatic chains.

**Figure 2**
**Workflow for reaction enumeration: After loading a generic reaction (**IN REACTION, from a MDL RXN file) and two educt lists (IN REACTANTS 1, IN REACTANTS 2, from MDL SD files) the Reaction Enumeratorworker performs the enumeration with the results stored as MDL RXN files. An additional PDF file is created which shows all enumerated reactions in a tabular manner. The results are stored in the output folder determined by the OUT input port.

**Figure 3**
**Capabilities of the advanced reaction enumerator: The sketched generic reaction contains three different generic groups labelled X, Y and Z**. Group × defines a *Variable RGroup* which can freely attach to all atoms of the ring. The *Atom Alias* group labelled Y is a wild card for the elements carbon, oxygen and nitrogen. The *Expandable Atom* group Z defines a variable ring size: The ring can be expanded by up to two additional carbon atoms. The enumerated products with the small letters a and b originate from *multi-match detection*.

**Figure 4**
**Molecule curation and atom signature descriptor generation workflow: The** Iterative SDfile Readertakes the Structure-Data File (SDF) of compounds (Input SDF) as input and pass the structures down the workflow for molecule curation and atom signature generation. The number of structures to be read, and pumped down the workflow can be configured (Iterations). As soon as the molecule is read, the Tag Molecules with UUID worker tags the molecule with Universal Unique IDentifier (UUID) to keep track of it during the process. The Molecule connectivity checker worker checks the connectedness of the structure and removes counter ions and disconnected fragments. The Remove sugar groups worker removes linear and ring sugars from the structures. The Curate Strange Elements worker removes structure containing elements other than non-metals. Finally, the Generate Atom Signatures worker generates atom signature for each atom in a curated compound, tagged with the respective UUID of the compound. The generated atom signatures are written out to a text file (signatures file) using the Text File Writer worker. The SDF of compound structures can be written out to a file, after tagging with UUID (Tagged SDFile), and also after any curation step (Curated SDF) using the SDFile Writer worker. This workflow can be freely downloaded at http://www.myexperiment.org/workflows/2120.html.

**Figure 5**
**NP-likeness scoring workflow: This workflow take inputs of atom signatures file generated from the user defined natural products library (**NP file) as well as synthetics (SM file) and compound libraries (Query file) and score the compound libraries (Query file) for NP-likeness. The higher the score the more is the NP-likeness of a molecule. The Query fragments scorer worker generates score for each compound in the Query file tagged with the corresponding UUID of the compound. Pairs of compound's UUID and score are written out to a text file (Score file) which can also be passed to the Plot Distribution As PDF worker to see the distribution of the score density of the complete query dataset. The Query fragments scorer worker also regenerates structure for every atom signature and tags it with its corresponding fragment score and UUID of the compound to which it belong to. These fragment structures with scores are written out to a SDF file (Fragments SDF), as they are helpful in identifying fragments with high NP-likeness. This workflow can be freely downloaded at http://www.myexperiment.org/workflows/2121.html.

**Figure 6**
Genetic algorithm for selection of an optimum reduced set of input vector components: The algorithm starts with a random population in which each chromosome consists of a random distribution of enabled/disabled (on/off) input vector components denoted A₁to *A_n*(where the number of components with "on" status remains fixed during evolution). This distribution is changed by mutation and cross-over. The fitness of each chromosome is evaluated by the inverse square RMSE. The selection process for each generation is performed by Roulette wheel selection where chromosomes are inherited with probabilities that correspond to their particular fitness.

**Figure 7**
"Leave-One-Out" analysis to estimate the significance of input vector components: The root mean square error (RMSE) rises with an increasing number of discarded components (i.e. a decreasing number of input vector components used for the machine filearning procedure). The relative RMSE shift from step to step may be correlated with the significance of the discarded component. In this case it is shown that the first fifty components do only have a negligible in influence on the machine learning result and thus may be excluded from further analysis.

**Figure 8**
**Workflow for "Leave-One-Out" analysis: First a regression dataset is generated from a CSV file with UUID and molecular descriptor input data for each molecule (**IN QSAR) and a CSV file containing the UUID of the molecule and the corresponding output (regression) value (IN RTID). Then the Leave-One-Out Attribute Selection worker evaluates the significance of the input components and generates a dataset for each evaluation step. Afterwards the composed datasets are coded as XRFF files. A CSV file with the sequence of discarded input vector components is generated. In addition the results are visualised with a PDF output file. Instead of the Leave-One-Out Attribute Selection worker a GA Attribute Selection worker may be used to determine a minimum molecular descriptor subset with maximum predictability. The results are stored in the output folder determined by the OUT input port.

**Figure 9**
**Partitioning into training and test set: A regression dataset is split into a training and a test set which is performed by the** Split Dataset Into Train-/Testset. Then a regression model is created by the Weka Regression worker and evaluated by the Evaluate Regression Results as PDF which stores the results in a PDF file. The dataset is read from a XRFF file (IN XRFF). The generated test and training sets are coded as XRFF files and stored on hard disk. The OUT input port determines the result output folder.

**Figure 10**
**Configuration panel for the Weka Regression worker: The configuration for a three-layer perceptron neural networks is selected**. Each machine learning method consists of a parameter panel for individual configuration.

**Figure 11**
**Diagrams for machine learning results:** (upper left) Scatter plot with experimental versus predicted output values. (upper right) Residuals plot with differences between the predicted and experimental output values. (lower left) Experimental output data are plotted over corresponding sorted predicted output data. (lower right) Characteristic quantities of the predicted model.

See this image and copyright information in PMC

Cited by

A survey of quantitative descriptions of molecular structure.
Guha R, Willighagen E. Guha R, et al. Curr Top Med Chem. 2012;12(18):1946-56. doi: 10.2174/156802612804910278. Curr Top Med Chem. 2012. PMID: 23110530 Free PMC article.
The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching.
Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluskal T, Rojas-Chertó M, Spjuth O, Torrance G, Evelo CT, Guha R, Steinbeck C. Willighagen EL, et al. J Cheminform. 2017 Jun 6;9(1):33. doi: 10.1186/s13321-017-0220-4. J Cheminform. 2017. PMID: 29086040 Free PMC article.
Applications of the InChI in cheminformatics with the CDK and Bioclipse.
Spjuth O, Berg A, Adams S, Willighagen EL. Spjuth O, et al. J Cheminform. 2013 Mar 13;5(1):14. doi: 10.1186/1758-2946-5-14. J Cheminform. 2013. PMID: 23497723 Free PMC article.
The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud.
Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P, Bhagat J, Belhajjame K, Bacall F, Hardisty A, Nieva de la Hidalga A, Balcazar Vargas MP, Sufi S, Goble C. Wolstencroft K, et al. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W557-61. doi: 10.1093/nar/gkt328. Epub 2013 May 2. Nucleic Acids Res. 2013. PMID: 23640334 Free PMC article.
Scientific workflow systems: Pipeline Pilot and KNIME.
Warr WA. Warr WA. J Comput Aided Mol Des. 2012 Jul;26(7):801-4. doi: 10.1007/s10822-012-9577-7. Epub 2012 May 27. J Comput Aided Mol Des. 2012. PMID: 22644661 Free PMC article. No abstract available.

See all "Cited by" articles

References

1. Hassan M, Brown R, Varma-O'brien S, Rogers D. Cheminformatics analysis and learning in a data pipelining environment. Molecular diversity. 2006;10(3):283–299. doi: 10.1007/s11030-006-9041-5. - DOI - PubMed
1. Shon J, Ohkawa H, Hammer J. Scientific workflows as productivity tools for drug discovery. Current opinion in drug discovery and development. 2008;11(3):381–388. - PubMed
1. Oinn T, Li P, Kell D, Goble C, Goderis A, Greenwood M, Hull D, Stevens R, Turi D, Zhao J. Taverna/my Grid: Aligning a Workflow System with the Life Sciences Community. Workflows for e-Science. 2007. pp. 300–319.http://www.springerlink.com/index/l9425v576v544vv3.pdf
1. Kuhn T, Willighagen E, Zielesny A, Steinbeck C. CDK-Taverna: an open workflow environment for cheminformatics. BMC Bioinformatics. 2010;11:159. doi: 10.1186/1471-2105-11-159. - DOI - PMC - PubMed
1. Missier P, Soiland-Reyes S, Owen S, Tan W, Nenadic A, Dunlop I, Williams A, Oinn T, Goble C. Taverna, Reloaded. Lecture Notes in Computer Science. 2010;6187:471–481. doi: 10.1007/978-3-642-13818-8_33. - DOI

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

New developments on the cheminformatics open workflow environment CDK-Taverna

Affiliation

New developments on the cheminformatics open workflow environment CDK-Taverna

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources