Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Apr 3:6:7.
doi: 10.3389/fninf.2012.00007. eCollection 2012.

The pipeline system for Octave and Matlab (PSOM): a lightweight scripting framework and execution engine for scientific workflows

Affiliations

The pipeline system for Octave and Matlab (PSOM): a lightweight scripting framework and execution engine for scientific workflows

Pierre Bellec et al. Front Neuroinform. .

Abstract

The analysis of neuroimaging databases typically involves a large number of inter-connected steps called a pipeline. The pipeline system for Octave and Matlab (PSOM) is a flexible framework for the implementation of pipelines in the form of Octave or Matlab scripts. PSOM does not introduce new language constructs to specify the steps and structure of the workflow. All steps of analysis are instead described by a regular Matlab data structure, documenting their associated command and options, as well as their input, output, and cleaned-up files. The PSOM execution engine provides a number of automated services: (1) it executes jobs in parallel on a local computing facility as long as the dependencies between jobs allow for it and sufficient resources are available; (2) it generates a comprehensive record of the pipeline stages and the history of execution, which is detailed enough to fully reproduce the analysis; (3) if an analysis is started multiple times, it executes only the parts of the pipeline that need to be reprocessed. PSOM is distributed under an open-source MIT license and can be used without restriction for academic or commercial projects. The package has no external dependencies besides Matlab or Octave, is straightforward to install and supports of variety of operating systems (Linux, Windows, Mac). We ran several benchmark experiments on a public database including 200 subjects, using a pipeline for the preprocessing of functional magnetic resonance images (fMRI). The benchmark results showed that PSOM is a powerful solution for the analysis of large databases using local or distributed computing resources.

Keywords: Matlab; Octave; high-performance computing; neuroimaging; open-source; parallel computing; pipeline; workflow.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Examples of dependency graphs. In panel (A), the input file of the job quadratic is an output of the job sample; sample thus needs to be completed before starting quadratic. This type of dependency (“file-passing”) can be represented as a directed dependency graph. In panel (B), the job cleanup deletes an input file of quadratic; quadratic thus needs to be completed before starting cleanup. Note that such “cleanup” dependencies may involve more than two jobs: if cleanup deletes some input files used by both quadratic and cubic, cleanup depends on both of them (panel C). The same property holds for “file-passing” dependencies: if sum is using the outputs of both quadratic and cubic, sum depends on both jobs (panel D).
Figure 2
Figure 2
Pipeline execution: a first pass through the toy pipeline. Each panel represents one step in the execution of the toy pipeline presented in Section 2, without the cleanup job. This example assumes that at least two jobs can run in parallel, and that the pipeline was not executed before. All jobs are executed as soon as all of their dependencies are satisfied, possibly with some jobs running in parallel.
Figure 3
Figure 3
Pipeline management, example 2: updating a pipeline (with one bug). Each panel represents one step in the execution of the toy pipeline presented in Section 2, without the cleanup job. This example assumes that at least two jobs can run in parallel, and that the pipeline has already been executed once as outlined in Figure 2. The pipeline is first started after changing the job quadratic to introduce a bug (panels A–B). When the execution of the pipeline fails, the job quadratic is modified to fix the bug. The pipeline is then restarted and completes successfully (panels C–E).
Figure 4
Figure 4
Pipeline management, example 3: adding a (cleanup) job. This example assumes that the toy pipeline (without the cleanup job) had already been successfully completed. The full toy pipeline (with the cleanup job) is then submitted for execution. The only job that is not yet processed is cleanup, and the pipeline execution ends after cleanup successfully completes.
Figure 5
Figure 5
Pipeline management, example 4: restarting a job after its inputs have been cleaned up. This example assumes that the full toy pipeline (including the cleanup job) has already been successfully completed. The same pipeline is then submitted for a new run and the job quadratic is forced to be restarted. Because the inputs of quadratic (generated by sample) have been deleted by cleanup, the pipeline manager also restarts the job sample (panel A). Because all jobs depend indirectly on sample, all jobs in the pipeline have to be reprocessed (panels B–D).
Figure 6
Figure 6
Overview of the PSOM implementation. On the user's side (left panel), a structure pipeline is built to describe the list of jobs, and a structure opt_pipe is used to configure PSOM. The memory of the pipeline is a logs folder located on the disk space (right panel), in which a series of files are stored to provide a comprehensive record of multiple runs of pipeline execution. The PSOM proceeds in three stages (center panel). At the initialization stage, the current pipeline is compared with previous executions to set up a “to-do” list of jobs that needs to be (re)started. Then, the pipeline manager is started, which constantly submits jobs for execution and monitors the status of on-going jobs. Finally, each job is executed independently by a job manager which reports the completion status upon termination (either “failed” or “finished”).
Figure 7
Figure 7
An example of dependency graph for the NIAK fMRI preprocessing pipeline. This example includes two subjects with two fMRI datasets each. The pipeline includes close to 100 jobs, and cleanup jobs have been removed to simplify the represen-tation. Colors have been used to code the main stages of the preprocessing.
Figure 8
Figure 8
Benchmark experiments with the NIAK fMRI preprocessing pipeline. The distribution of execution time for all jobs on one server (peuplier) is shown in panel (A). The number of jobs running at any given time across the whole execution of the pipeline (averaged on 5 min time windows) is shown in panels (B–D) for servers peuplier, magma and guillimin, respectively. The user-specified maximum number of concurrent jobs is indicated by a straight line. The serial execution time of the pipeline, i.e., the sum of execution times for all jobs, is shown in panel (E). The parallel execution time, i.e., the time elapsed between the beginning and the end of the pipeline processing, is shown in panel (F). The speed-up factor, i.e., serial time divided by parallel time, is presented in panel (G), along with the ideal speed-up, equal to the user-specified maximal number of concurrent jobs. Finally, the parallelization efficiency (i.e., the ratio between the empirical speed-up and the ideal speed-up) is presented in panel (H).

References

    1. Ad-Dab'bagh Y., Einarson D., Lyttelton O., Muehlboeck J. S., Mok K., Ivanov O., Vincent R. D., Lepage C., Lerch J., Fombonne E., Evans A. C. (2006). “The CIVET image-processing environment: a fully automated comprehensive pipeline for anatomical neuroimaging research,” in Proceedings of the 12th Annual Meeting of the Human Brain Mapping Organization. Neuroimage, ed M. Corbetta (Florence, Italy).
    1. Armstrong T. G. (2011). Integrating Task Parallelism into the Python Programming Language. Master's thesis, The University of Chicago.
    1. Ashburner J. (2011). SPM: a history. Neuroimage. [Epub ahead of print]. 10.1016/j.neuroimage.2011.10.025 - DOI - PMC - PubMed
    1. Baker H. G., Hewitt C. (1977). “The Incremental Garbage Collection of Processes. Technical Report,” in Proceedings of the 1977 symposium on Artificial intelligence and programming languages archive. (New York, NY: ACM).
    1. Bellec P., Carbonell F. M., Perlbarg V., Lepage C., Lyttelton O., Fonov V., Janke A., Tohka J., Evans A. C. (2011). “A neuroimaging analysis kit for Matlab and Octave,” in Proceedings of the 17th International Conference on Functional Mapping of the Human Brain. (Quebec, QC, Canada).

LinkOut - more resources