Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 7;23(1):123.
doi: 10.1186/s12859-022-04648-4.

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

Affiliations

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

Simone Pallotta et al. BMC Bioinformatics. .

Abstract

Background: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures.

Results: We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions.

Conclusions: RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.

Keywords: Data scalability; Distribution transparency; Heterogeneous omics big data; Tertiary data analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Representation of the RGMQL package within the R/Bioconductor environment. REST Web services and Sequential execution modules can handle alternative RGMQL processing environments, together with their dependency links to httr and rJava R packages, respectively
Fig. 2
Fig. 2
Representation of RGMQL functions for data import/export both locally and remotely. A GMQLDataset is created by the read_GMQL() function from a local dataset (in GDM or different tab-delimited format), or from a remote dataset (specifying is_local = FALSE). Any processing is applied on the involved GMQLDataset objects, and the computation and materialization of any result (remotely or locally) is deferred until the collect() and execute() functions are called. A GMQLDataset can be created also by the read_GRangesList() function from a GRangesList. Similarly, a GRangesList can be obtained from a remote dataset through the download_as_GRangesList() function, from a local dataset through the import_GMQL() function and, in local processing only, directly from a GMQLDataset through the take() function
Fig. 3
Fig. 3
Top 20 genes by percentage of the 217 patients under analysis with the gene mutated
Fig. 4
Fig. 4
Top 20 genes by number of mutations per gene length across the 217 patients considered
Fig. 5
Fig. 5
Clusters from patient-wise hierarchical clustering on the first two dimensions of the data principal component analysis. The fraction of variance explained by each dimension is reported as percentage in the corresponding axis label
Fig. 6
Fig. 6
Mosaic plot of the three clusters emerged from patient-wise hierarchical clustering compared with the published clustering results obtained in [48] using the K4 gene signature
Fig. 7
Fig. 7
Mosaic plot of the three clusters emerged from patient-wise hierarchical clustering compared with the patient overall survival status annotations
Fig. 8
Fig. 8
Plot of the transcription factor accumulation for chromosome 21 and of the 186 HOT zones (in red) identified according to the found accumulation threshold 5.6 (red line)

Similar articles

Cited by

References

    1. Stark Z, Dolman L, Manolio TA, Ozenberger B, Hill SL, Caulfied MJ, Levy Y, Glazer D, Wilson J, Lawler M, et al. Integrating genomics into healthcare: a global responsibility. Am J Hum Genet. 2019;104(1):13–20. - PMC - PubMed
    1. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, Staudt LM. Toward a shared vision for cancer genomic data. N Engl J Med. 2016;375(12):1109–1112. - PMC - PubMed
    1. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–1120. - PMC - PubMed
    1. 1000 Genomes Project Consortium, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061. - PMC - PubMed
    1. ENCODE Project Consortium, et al.: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57. - PMC - PubMed

LinkOut - more resources