Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jul 12:13:163.
doi: 10.1186/1471-2105-13-163.

A graph-based approach for designing extensible pipelines

Affiliations

A graph-based approach for designing extensible pipelines

Maíra R Rodrigues et al. BMC Bioinformatics. .

Abstract

Background: In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps.

Results: We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms.

Conclusions: Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Graphic representation of the pipeline system algorithm. Graphic representation of the pipeline system algorithm. (1) algorithm inputs: start and end points, A and F (which are data formats), for a specific processing task, and the tool registry file; (2) directed graph built based on information from the tool registry, where regular nodes represent inputs and outputs, edges represent tools (denoted by their Code) and have a specific weight (wj), and double circled nodes represent input dependencies (XI) or secondary outputs (XO); (3) path through the graph connecting the start and end points, PA,F((A,B),(B,C),(C,E),(E,F)), generated by a graph-traversing procedure; (4) executable task-specific pipeline, which specifies the required inputs for the pipeline (file .inputs), the sequence of tools to be run (file .exec) and the output file (file .outputs).
Figure 2
Figure 2
Web interface for our format conversion pipeline. Three scenarios are depicted: a conversion from SDAT format to R HierFstat format (denoted in green); a conversion from the PolyPhred output format to the Structure input format (denoted in purple); and a conversion from the PHASE output format to DnaSP input format or Fasta format (denoted in blue).
Figure 3
Figure 3
Tool Graph for our format conversion pipeline system. Nodes are popular data formats from population genetics and genetic epidemiology. Edges are labelled with the conversion tool’s Code and have an associated weight (represented in round brackets) indicating the tools’ performance.

References

    1. Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4680. - PMC - PubMed
    1. Felsenstein J. PHYLIP – Phylogeny Inference Package (Version 3.2) Cladistics. 1989;5:164–166.
    1. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Bio Sci. 1997;13:555–556. - PubMed
    1. Yang Z. PAML 4: a program package for phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. - DOI - PubMed

Publication types