Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 May;20(9):e1900147.
doi: 10.1002/pmic.201900147. Epub 2019 Dec 18.

Scalable Data Analysis in Proteomics and Metabolomics Using BioContainers and Workflows Engines

Affiliations
Review

Scalable Data Analysis in Proteomics and Metabolomics Using BioContainers and Workflows Engines

Yasset Perez-Riverol et al. Proteomics. 2020 May.

Abstract

The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, bioinformatics analysis is becoming increasingly complex and convoluted, involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are designed as single-tiered software application where the analytics tasks cannot be distributed, limiting the scalability and reproducibility of the data analysis. In this paper the key steps of metabolomics and proteomics data processing, including the main tools and software used to perform the data analysis, are summarized. The combination of software containers with workflows environments for large-scale metabolomics and proteomics analysis is discussed. Finally, a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflow environments, Galaxy and Nextflow, is introduced to the proteomics and metabolomics communities.

Keywords: bioconda; biocontainers; bioinformatics; containers; large scale data analysis; workflows.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Mass spectrometry metabolomics and proteomics bioinformatics workflows.
The Proteomics lane (right) represent a Database search Label-free analysis workflow including Feature detection on MS1 spectra, protein database creation, database search, statistical analysis and final protein inference step. The metabolomics workflow represents a common spectral search workflow.
Figure 2
Figure 2
The proposed roadmap to scale metabolomics and proteomics data analysis includes the packaging and containerization of the specific tool and software using BioConda and BioContainers. The design of bioinformatics workflows that use the specific containers and abstract the execution from the compute environment (e.g. Cloud or HPC). A very important step of this design is the use of standard file formats that enable to communicate different tools and steps of the workflow.
Figure 3
Figure 3
BioContainers architecture from the container request by the user in GitHub to the final container deposited in DockerHub (https://hub.docker.com/u/biocontainers) and Quay.io (https://quay.io/organization/biocontainers). The BioContainers community in collaboration with the BioConda community defines a set of guidelines and protocols to create a Conda and Docker container including mandatory metadata, tests and trusted images [9]. The proposed architecture uses a continuous integration system (CI) to test and build the final containers and deposit them into public registries. All the Containers and tools can be searched from the BioContainers registry (http://biocontainers.pro/regitry).
Figure 4
Figure 4. Nextflow allows bioinformaticians to perform analysis in different architectures with the same workflow definition.
(A) The workflow step (called process) describes which process will be performed and the input/output parameters. The container section inside the blastSearch process state which containers will be use; including container name (blast), and version of the container (v2.2.31_cv2). Between triple quotes is the actual command will be executed in the container (in this case blast). This is needed because one container can provide multiple tools. (B) The Nextflow config file (https://www.nextflow.io/docs/latest/config.html) defines how the present workflow (A) will be executed. In the example, we have defined two possible scenarios: local and lsf. If the user runs the workflow using the local configuration it will be using Docker containers, if the user uses lsf, then it will be using singularity and the LSF cluster executor. (C) Directed Acyclic Graph for a peptide and protein identification workflow in Nextflow (https://github.com/bigbio/nf-workflows/tree/master/xt-msgf-nf).
Figure 5
Figure 5. A Galaxy workflow from PhenoMeNal H2020, used for processing LC-MS/MS data. It integrates XCMS, CAMERA, msnbase and MetFrag for matching detected fragments to potential small molecules.
Each box represents a relevant tool step and is backed by a container that can execute that process (all CAMERA steps rely of course on the same CAMERA container). Plumbing such a pipeline through a scripting language would require considerable work, including any additional logic to execute on a cluster. In the case shown here, this workflow was created by dragging and dropping tools in Galaxy, and there was no need for the analyst to be concerned about how the workflow environment is actually distributing this on a large computational infrastructure: handling the cluster and the workflow become independent and the analyst can focus on the flow of the software tools within the pipeline. The Galaxy workflow can be shared as a single file, to be imported into other Galaxy instances.

References

    1. Griffiths WJ, Wang Y. Chem Soc Rev. 2009;38:1882. - PubMed
    1. Perez-Riverol Y, Bai M, da Veiga Leprevost F, Squizzato S, Park YM, Haug K, Carroll AJ, Spalding D, Paschall J, Wang M, Del-Toro N, et al. Nat Biotechnol. 2017;35:406. - PMC - PubMed
    1. Lynch C. Nature. 2008;455:28. - PubMed
    1. Kanwal S, Khan FZ, Lonie A, Sinnott RO. BMC bioinformatics. 2017;18:337. - PMC - PubMed
    1. Perez-Riverol Y, Wang R, Hermjakob H, Muller M, Vesada V, Vizcaino JA. Biochim Biophys Acta. 2014;1844:63. - PMC - PubMed

Publication types

LinkOut - more resources