RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
- PMID: 35392801
- PMCID: PMC8991469
- DOI: 10.1186/s12859-022-04648-4
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
Abstract
Background: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures.
Results: We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions.
Conclusions: RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
Keywords: Data scalability; Distribution transparency; Heterogeneous omics big data; Tertiary data analysis.
© 2022. The Author(s).
Conflict of interest statement
The authors declare that they have no competing interests.
Figures








Similar articles
-
PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.BMC Bioinformatics. 2019 Nov 8;20(1):560. doi: 10.1186/s12859-019-3159-9. BMC Bioinformatics. 2019. PMID: 31703553 Free PMC article.
-
GenoMetric Query Language: a novel approach to large-scale genomic data management.Bioinformatics. 2015 Jun 15;31(12):1881-8. doi: 10.1093/bioinformatics/btv048. Epub 2015 Feb 3. Bioinformatics. 2015. PMID: 25649616
-
Data Management for Heterogeneous Genomic Datasets.IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7. IEEE/ACM Trans Comput Biol Bioinform. 2017. PMID: 27295683
-
Cloud Computing Enabled Big Multi-Omics Data Analytics.Bioinform Biol Insights. 2021 Jul 28;15:11779322211035921. doi: 10.1177/11779322211035921. eCollection 2021. Bioinform Biol Insights. 2021. PMID: 34376975 Free PMC article. Review.
-
Using R and Bioconductor in Clinical Genomics and Transcriptomics.J Mol Diagn. 2020 Jan;22(1):3-20. doi: 10.1016/j.jmoldx.2019.08.006. Epub 2019 Oct 9. J Mol Diagn. 2020. PMID: 31605800 Review.
Cited by
-
Biologically weighted LASSO: enhancing functional interpretability in gene expression data analysis.Bioinformatics. 2024 Oct 1;40(10):btae605. doi: 10.1093/bioinformatics/btae605. Bioinformatics. 2024. PMID: 39412436 Free PMC article.
-
Genomic data integration and user-defined sample-set extraction for population variant analysis.BMC Bioinformatics. 2022 Sep 29;23(1):401. doi: 10.1186/s12859-022-04927-0. BMC Bioinformatics. 2022. PMID: 36175857 Free PMC article.
-
Processing genome-wide association studies within a repository of heterogeneous genomic datasets.BMC Genom Data. 2023 Mar 3;24(1):13. doi: 10.1186/s12863-023-01111-y. BMC Genom Data. 2023. PMID: 36869294 Free PMC article.
References
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources