Interoperable and scalable data analysis with microservices: applications in metabolomics

doi:10.1093/bioinformatics/btz160

. 2019 Oct 1;35(19):3752-3760.

doi: 10.1093/bioinformatics/btz160.

Interoperable and scalable data analysis with microservices: applications in metabolomics

Payam Emami Khoonsari¹, Pablo Moreno², Sven Bergmann^{3

4}, Joachim Burman⁵, Marco Capuccini^{6

7}, Matteo Carone⁷, Marta Cascante^{8

9}, Pedro de Atauri^{8

9}, Carles Foguet^{8

9}, Alejandra N Gonzalez-Beltran¹⁰, Thomas Hankemeier¹¹, Kenneth Haug², Sijin He², Stephanie Herman^{1

7}, David Johnson¹⁰, Namrata Kale², Anders Larsson^{7

12}, Steffen Neumann^{13

14}, Kristian Peters¹³, Luca Pireddu¹⁵, Philippe Rocca-Serra¹⁰, Pierrick Roger¹⁶, Rico Rueedi^{3

4}, Christoph Ruttkies¹³, Noureddin Sadawi¹⁷, Reza M Salek¹⁸, Susanna-Assunta Sansone¹⁰, Daniel Schober¹³, Vitaly Selivanov^{8

9}, Etienne A Thévenot¹⁶, Michael van Vliet¹¹, Gianluigi Zanetti¹⁵, Christoph Steinbeck^{2

19}, Kim Kultima¹, Ola Spjuth⁷

Affiliations

¹ Department of Medical Sciences, Clinical Chemistry, Uppsala University, Uppsala, Sweden.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK.
³ Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
⁴ Swiss Institute of Bioinformatics, Lausanne, Switzerland.
⁵ Department of Neuroscience, Uppsala University, Uppsala, Sweden.
⁶ Department of Information Technology, Uppsala University, Uppsala, Sweden.
⁷ Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.
⁸ Department of Biochemistry and Molecular Biomedicine, and Institute of Biomedicine (IBUB), Faculty of Biology, Universitat de Barcelona (IBUB), Barcelona, Spain.
⁹ Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD) and Metabolomics Node at INB-Bioinfarmatics Platform, Instituto de Salud Carlos III (ISCIII), Madrid, Spain.
¹⁰ Oxford e-Research Centre, Department of Engineering Science, University of Oxford, Oxford, UK.
¹¹ Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands.
¹² National Bioinformatics Infrastructure Sweden, Uppsala University, Uppsala, Sweden.
¹³ Department of Stress and Developmental Biology, Leibniz Institute of Plant Biochemistry, Halle, Germany.
¹⁴ German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany.
¹⁵ CRS4: Center for Advanced Studies, Research and Development in Sardinia, Distributed Computing Group, Pula, Italy.
¹⁶ CEA, LIST, Laboratory for Data Analysis and Systems' Intelligence, MetaboHUB, Gif-sur-Yvette, France.
¹⁷ Faculty of Medicine, Department of Surgery & Cancer, Imperial College London, London, UK.
¹⁸ International Agency for Research on Cancer, 69372 Lyon CEDEX 08, France.
¹⁹ Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University, Jena, Germany.

PMID: 30851093
PMCID: PMC6761976
DOI: 10.1093/bioinformatics/btz160

Interoperable and scalable data analysis with microservices: applications in metabolomics

Payam Emami Khoonsari et al. Bioinformatics. 2019.

. 2019 Oct 1;35(19):3752-3760.

doi: 10.1093/bioinformatics/btz160.

Authors

Affiliations

¹ Department of Medical Sciences, Clinical Chemistry, Uppsala University, Uppsala, Sweden.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK.
³ Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
⁴ Swiss Institute of Bioinformatics, Lausanne, Switzerland.
⁵ Department of Neuroscience, Uppsala University, Uppsala, Sweden.
⁶ Department of Information Technology, Uppsala University, Uppsala, Sweden.
⁷ Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.
⁸ Department of Biochemistry and Molecular Biomedicine, and Institute of Biomedicine (IBUB), Faculty of Biology, Universitat de Barcelona (IBUB), Barcelona, Spain.
⁹ Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD) and Metabolomics Node at INB-Bioinfarmatics Platform, Instituto de Salud Carlos III (ISCIII), Madrid, Spain.
¹⁰ Oxford e-Research Centre, Department of Engineering Science, University of Oxford, Oxford, UK.
¹¹ Division of Analytical Biosciences, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands.
¹² National Bioinformatics Infrastructure Sweden, Uppsala University, Uppsala, Sweden.
¹³ Department of Stress and Developmental Biology, Leibniz Institute of Plant Biochemistry, Halle, Germany.
¹⁴ German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany.
¹⁵ CRS4: Center for Advanced Studies, Research and Development in Sardinia, Distributed Computing Group, Pula, Italy.
¹⁶ CEA, LIST, Laboratory for Data Analysis and Systems' Intelligence, MetaboHUB, Gif-sur-Yvette, France.
¹⁷ Faculty of Medicine, Department of Surgery & Cancer, Imperial College London, London, UK.
¹⁸ International Agency for Research on Cancer, 69372 Lyon CEDEX 08, France.
¹⁹ Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University, Jena, Germany.

PMID: 30851093
PMCID: PMC6761976
DOI: 10.1093/bioinformatics/btz160

Abstract

Motivation: Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed using the Kubernetes container orchestrator.

Results: We developed a Virtual Research Environment (VRE) which facilitates rapid integration of new tools and developing scalable and interoperable workflows for performing metabolomics data analysis. The environment can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry, one nuclear magnetic resonance spectroscopy and one fluxomics study. We showed that the method scales dynamically with increasing availability of computational resources. We demonstrated that the method facilitates interoperability using integration of the major software suites resulting in a turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, statistics and identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science.

Availability and implementation: The PhenoMeNal consortium maintains a web portal (https://portal.phenomenal-h2020.eu) providing a GUI for launching the Virtual Research Environment. The GitHub repository https://github.com/phnmnl/ hosts the source code of all projects.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of the components in a microservices-based framework. Complex applications are divided into smaller, focused and well-defined (micro-) services. These services are independently deployable and can communicate with each other, which allows to couple them into complex task pipelines, i.e. data processing workflows. The user can interact with the framework programmatically via an Application Program Interface (API) or via a graphical user interface (GUI) to construct or run workflows of different services, which are executed independently. Multiple instances of services can be launched to execute tasks in parallel, which effectively can be used to scale analysis over multiple compute nodes. When run in an elastic cloud environment, virtual resources can be added or removed depending on the computational requirements

**Fig. 2.**
Diagram of scalability-testing on a metabolomics dataset (MetaboLights ID: MTBLS233) in Demonstrator 1 to illustrate the scalability of a microservice approach. A) The preprocessing workflow is composed of 5 OpenMS tasks that were run in parallel over the 12 groups in the dataset using the Luigi workflow system. The first two tasks, peak picking (528 tasks) and feature finding (528 tasks), are trivially parallelizable, hence they were run concurrently for each sample. The subsequent feature linking task needs to process all of the samples in a group at the same time, therefore 12 of these tasks were run in parallel. In order to maximize the parallelism, each feature linker container (microservice) was run on 2 CPUs. Feature linking produces a single file for each group, that can be processed independently by the last two tasks: file filter (12 tasks) and text exporter (12 tasks), resulting in total of 1092 tasks. The downstream analysis consisted of 6 tasks that were carried out in a Jupyter Notebook. Briefly, the output of preprocessing steps was imported into R and the unstable signals were filtered out. The missing values were imputed and the resulting number of features were plotted. B) The weak scaling efficiency plot for Demonstrator 1. Given the full MTBLS233 dataset, the preprocessing was run on 40 Luigi workers. Then for 1/4, 2/4, 3/4 of MTBLS233, the analysis was run again on 10, 20 and 30 workers respectively. For each run, we measured the processing time T10, T20, T30 and T40, and we computed the WSEn = T10/Tn for n = 10, 20, 30, 40. The WSE plot shows scalability up to 40 CPUs, where we achieved ∼88% scaling efficiency. The running time for the full dataset (a total of 1092 tasks) on 40 workers was ∼4 hours

**Fig. 3.**
Overview of the workflow used to process multiple-sclerosis samples in Demonstrator 2, where a workflow was composed of the microservices using the Galaxy system. The data was centroided and limited to a specific mass over charge (m/z) range using OpenMS tools. The mass traces quantification and retention time correction was done via XCMS (Smith *et al.*, 2006). Unstable signals were filtered out based on the blank and dilution series samples using an in-house function (implemented in R). Annotation of the peaks was performed using CAMERA (Kuhl *et al.*, 2012). To perform the metabolite identification, the tandem spectra from the MS/MS samples in mzML format were extracted using MSnbase and passed to MetFrag. The MetFrag scores were converted to q-values using Passatutto software. The result of identification and quantification were used in ‘Multivariate’ and ‘Univariate’ containers from Workflow4Metabolomics (Giacomoni *et al.*, 2015) to perform Partial Least Squares Discriminant Analysis (PLS-DA)

**Fig. 4.**
The results from analysis of multiple sclerosis data in Demonstrator 2, presenting new scientifically useful biomedical knowledge. A) The PLS-DA results suggest that the metabolite distribution in the RRMS and SPMS samples are different to controls. B) Three metabolites were identified as differentially regulated between multiple sclerosis subtypes and control samples, namely Alanyltryptophan and Indoleacetic acid with higher and Linoleoyl ethanolamide with lower abundance in both RRMS and SPMS compared to controls. Abbr., RRMS: relapsing-remitting multiple sclerosis, SPMS: secondary progressive multiple sclerosis

**Fig. 5.**
Overview of the NMR workflow in Demonstrator 3. The raw NMR data and experimental metadata (ISATab) was automatically imported from the Metabolights database and converted to open source nmrML format. The preprocessing was performed using the rnmr1d package part of nmrprocflow tools. All study factors were imported from MetaboLights and were fed to the multivariate node to perform an OPLS-DA

**Fig. 6.**
Overview of the workflow for fluxomics, with Ramid, Midcor, Iso2Flux and Escher-fluxomics tools supporting subsequent steps of the analysis. The example refers to HUVEC cells incubated in the presence of [1,2-¹³C₂]glucose and label (¹³C) propagation to glycogen, RNA ribose and lactate measured by mass spectrometry. Ramid reads the raw netCDF files, corrects baseline and extracts the peak intensities. The resulting peak intensities are corrected (natural abundance, overlapping peaks) by Midcor, which provides isotopologue abundances. Isotopologue abundances, together with a model description (SBML model, tracing data, constraints), are used by Iso2Flux to provide flux distributions through glycolysis and pentose-phosphate pathways, which are shown as numerical values associated to a metabolic scheme of the model by the Escher-fluxomics tool

See this image and copyright information in PMC

Cited by

On-demand virtual research environments using microservices.
Capuccini M, Larsson A, Carone M, Novella JA, Sadawi N, Gao J, Toor S, Spjuth O. Capuccini M, et al. PeerJ Comput Sci. 2019 Nov 11;5:e232. doi: 10.7717/peerj-cs.232. eCollection 2019. PeerJ Comput Sci. 2019. PMID: 33816885 Free PMC article.
Tackling the Challenges of 21^st-Century Open Science and Beyond: A Data Science Lab Approach.
Hollaway MJ, Dean G, Blair GS, Brown M, Henrys PA, Watkins J. Hollaway MJ, et al. Patterns (N Y). 2020 Sep 17;1(7):100103. doi: 10.1016/j.patter.2020.100103. eCollection 2020 Oct 9. Patterns (N Y). 2020. PMID: 33205137 Free PMC article.
From biomedical cloud platforms to microservices: next steps in FAIR data and analysis.
Sheffield NC, Bonazzi VR, Bourne PE, Burdett T, Clark T, Grossman RL, Spjuth O, Yates AD. Sheffield NC, et al. Sci Data. 2022 Sep 8;9(1):553. doi: 10.1038/s41597-022-01619-5. Sci Data. 2022. PMID: 36075919 Free PMC article.
Experience in Developing an FHIR Medical Data Management Platform to Provide Clinical Decision Support.
Semenov I, Osenev R, Gerasimov S, Kopanitsa G, Denisov D, Andreychuk Y. Semenov I, et al. Int J Environ Res Public Health. 2019 Dec 20;17(1):73. doi: 10.3390/ijerph17010073. Int J Environ Res Public Health. 2019. PMID: 31861851 Free PMC article.
Integration of magnetic resonance imaging and protein and metabolite CSF measurements to enable early diagnosis of secondary progressive multiple sclerosis.
Herman S, Khoonsari PE, Tolf A, Steinmetz J, Zetterberg H, Åkerfeldt T, Jakobsson PJ, Larsson A, Spjuth O, Burman J, Kultima K. Herman S, et al. Theranostics. 2018 Aug 7;8(16):4477-4490. doi: 10.7150/thno.26249. eCollection 2018. Theranostics. 2018. PMID: 30214633 Free PMC article.

See all "Cited by" articles

References

1. Allan R.N. (2009) Virtual Research Environments: From Portals to Science Gateways. ChandosŁ Publishing, Oxford, UK.
1. Amirkhani A. et al. (2005) Interferon-beta affects the tryptophan metabolism in multiple sclerosis patients. Eur. J. Neurol., 12, 625–631. - PubMed
1. Atkinson M. et al. (2017) Scientific workflows: past, present and future. Future Gener. Comput. Syst., 75, 216–227.
1. Baker D., Pryce G. (2008) The endocannabinoid system and multiple sclerosis. Curr. Pharm. Des., 14, 2326–2336. - PubMed
1. Berger B. et al. (2013) Computational solutions for omics data. Nat. Rev. Genet., 14, 333–346. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Allan R.N. (2009) Virtual Research Environments: From Portals to Science Gateways. ChandosŁ Publishing, Oxford, UK.

[2] Allan R.N. (2009) Virtual Research Environments: From Portals to Science Gateways. ChandosŁ Publishing, Oxford, UK.

[3] Amirkhani A. et al. (2005) Interferon-beta affects the tryptophan metabolism in multiple sclerosis patients. Eur. J. Neurol., 12, 625–631. - PubMed

[4] Amirkhani A. et al. (2005) Interferon-beta affects the tryptophan metabolism in multiple sclerosis patients. Eur. J. Neurol., 12, 625–631. - PubMed

[5] Atkinson M. et al. (2017) Scientific workflows: past, present and future. Future Gener. Comput. Syst., 75, 216–227.

[6] Atkinson M. et al. (2017) Scientific workflows: past, present and future. Future Gener. Comput. Syst., 75, 216–227.

[7] Baker D., Pryce G. (2008) The endocannabinoid system and multiple sclerosis. Curr. Pharm. Des., 14, 2326–2336. - PubMed

[8] Baker D., Pryce G. (2008) The endocannabinoid system and multiple sclerosis. Curr. Pharm. Des., 14, 2326–2336. - PubMed

[9] Berger B. et al. (2013) Computational solutions for omics data. Nat. Rev. Genet., 14, 333–346. - PMC - PubMed

[10] Berger B. et al. (2013) Computational solutions for omics data. Nat. Rev. Genet., 14, 333–346. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Interoperable and scalable data analysis with microservices: applications in metabolomics

Affiliations

Interoperable and scalable data analysis with microservices: applications in metabolomics

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources