A graph-based approach for designing extensible pipelines

Maíra R Rodrigues¹, Wagner C S Magalhães, Moara Machado, Eduardo Tarazona-Santos

Affiliations

PMID: 22788675
PMCID: PMC3496580
DOI: 10.1186/1471-2105-13-163

A graph-based approach for designing extensible pipelines

Maíra R Rodrigues et al. BMC Bioinformatics. 2012.

. 2012 Jul 12:13:163.

doi: 10.1186/1471-2105-13-163.

Authors

Maíra R Rodrigues¹, Wagner C S Magalhães, Moara Machado, Eduardo Tarazona-Santos

Affiliation

¹ Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil. maira.r.rodrigues@gmail.com

PMID: 22788675
PMCID: PMC3496580
DOI: 10.1186/1471-2105-13-163

Abstract

Background: In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps.

Results: We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms.

Conclusions: Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.

PubMed Disclaimer

Figures

**Figure 1**
**Graphic representation of the pipeline system algorithm.** Graphic representation of the pipeline system algorithm. (1) algorithm inputs: start and end points, A and F (which are data formats), for a specific processing task, and the tool registry file; (2) directed graph built based on information from the tool registry, where regular nodes represent inputs and outputs, edges represent tools (denoted by their Code) and have a specific weight (w_j), and double circled nodes represent input dependencies (XI) or secondary outputs (XO); (3) path through the graph connecting the start and end points, P_A,F((A,B),(B,C),(C,E),(E,F)), generated by a graph-traversing procedure; (4) executable task-specific pipeline, which specifies the required inputs for the pipeline (file .inputs), the sequence of tools to be run (file .exec) and the output file (file .outputs).

**Figure 2**
**Web interface for our format conversion pipeline.** Three scenarios are depicted: a conversion from SDAT format to R HierFstat format (denoted in green); a conversion from the PolyPhred output format to the Structure input format (denoted in purple); and a conversion from the PHASE output format to DnaSP input format or Fasta format (denoted in blue).

**Figure 3**
**Tool Graph for our format conversion pipeline system.** Nodes are popular data formats from population genetics and genetic epidemiology. Edges are labelled with the conversion tool’s Code and have an associated weight (represented in round brackets) indicating the tools’ performance.

See this image and copyright information in PMC

References

1. Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
1. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4680. - PMC - PubMed
1. Felsenstein J. PHYLIP – Phylogeny Inference Package (Version 3.2) Cladistics. 1989;5:164–166.
1. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Bio Sci. 1997;13:555–556. - PubMed
1. Yang Z. PAML 4: a program package for phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

1R01TW007894/TW/FIC NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A graph-based approach for designing extensible pipelines

Affiliation

A graph-based approach for designing extensible pipelines

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases