. 2019 Nov 8;20(1):560.

doi: 10.1186/s12859-019-3159-9.

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Luca Nanni¹, Pietro Pinoli², Arif Canakoglu², Stefano Ceri²

Affiliations

¹ Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy. luca.nanni@polimi.it.
² Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.

PMID: 31703553
PMCID: PMC6842186
DOI: 10.1186/s12859-019-3159-9

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Luca Nanni et al. BMC Bioinformatics. 2019.

. 2019 Nov 8;20(1):560.

doi: 10.1186/s12859-019-3159-9.

Authors

Luca Nanni¹, Pietro Pinoli², Arif Canakoglu², Stefano Ceri²

Affiliations

¹ Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy. luca.nanni@polimi.it.
² Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.

PMID: 31703553
PMCID: PMC6842186
DOI: 10.1186/s12859-019-3159-9

Abstract

Background: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.

Results: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.

Conclusions: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Keywords: Data scalability; Distribution transparency; Genomic data; Python; Tertiary data analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Schematic representation of the software components of PyGMQL. In the front-end, the GMQLDataset is a data structure associated with a query, referring directly to the DAG expressing the query operations. The GDataframe stores the query result and enables in-memory manipulation of the data. The front-end provides also a module for loading and storing data, and a RemoteManager module, used for message interchange between the package and an external GMQL service. The back-end interacts with the front-end through a Manager module, which maps the operations specified in Python with the GMQL operators implemented in Spark

**Fig. 2**
Relationships between GMQLDataset and GDataframe. Data can be imported into a GMQLDataset from a local GDM dataset with the load_from_path function. Using the load_from_file, it is possible to load generic BED files, while load_from_remote enables the loading of GDM datasets from an external GMQL repository, accessible through TCP connection. The user applies operation on the GMQLDataset and triggers the computation of the result with the materialize function. At the end of computation, the result is stored in-memory in a GDataframe, which can be then manipulated in Python. It is possible to import data directly from Pandas with from_pandas. Finally, it is possible to transform a GDataframe structure back into GMQLDataset using the to_GMQLDataset function

**Fig. 3**
Deployment modes and executor options of the library. When the library is in remote mode, it interfaces with an external GMQL service, hosting a GMQL repository (accessible by the Python program, which has been deployed on several file systems). When the mode is set to local, the library can operate on various file systems, based on the selected master

**Fig. 4**
Schematic representation of the deployment strategies adopted in the three applications. a Local/Remote system interaction for the analysis of ENCODE histone marks signal on promotorial regions. The gene dataset is stored in the local file system, the ENCODE BroadPeak database is hosted in the GMQL remote repository, deployed on the Hadoop file system with three slaves. b Configuration for the interactive analysis of the GWAS dataset against the whole set of enhancers from ENCODE. The library interacts directly with the YARN cluster and the data is stored in the Google Cloud File System with a fixed configuration of three slaves, accessed through the Hadoop engine. The gwas.tsv file is downloaded from the web and stored in the file system before executing the query. c Distributed setup for running the TICA query. Three datasets (from ENCODE and GENCODE) are in GDM format and stored in HDFS and the query runs on Amazon Web Services with a variable number of slave nodes, for evaluating the scalability of the system

**Fig. 5**
Execution time for the TICA query on three different cell lines, with four different cluster configurations

See this image and copyright information in PMC

References

1. Moorthie S, Hall A, Wright CF. Informatics and clinical genome sequencing: opening the black box. Genet Med. 2013;15(3):165. doi: 10.1038/gim.2012.116. - DOI - PubMed
1. Masseroli M, et al. Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods. 2016;111:3–11. doi: 10.1016/j.ymeth.2016.09.002. - DOI - PubMed
1. Masseroli M, et al. Genometric query language: a novel approach to large-scale genomic data management. Bioinformatics. 2015;31(12):1881–8. doi: 10.1093/bioinformatics/btv048. - DOI - PubMed
1. Masseroli Marco, Canakoglu Arif, Pinoli Pietro, Kaitoua Abdulrahman, Gulino Andrea, Horlova Olha, Nanni Luca, Bernasconi Anna, Perna Stefano, Stamoulakatou Eirini, Ceri Stefano. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics. 2018;35(5):729–736. doi: 10.1093/bioinformatics/bty688. - DOI - PubMed
1. Zaharia M, et al. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65. doi: 10.1145/2934664. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

693174/H2020 European Research Council

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Affiliations

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources