Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 8;20(1):560.
doi: 10.1186/s12859-019-3159-9.

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Affiliations

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Luca Nanni et al. BMC Bioinformatics. .

Abstract

Background: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.

Results: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.

Conclusions: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Keywords: Data scalability; Distribution transparency; Genomic data; Python; Tertiary data analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic representation of the software components of PyGMQL. In the front-end, the GMQLDataset is a data structure associated with a query, referring directly to the DAG expressing the query operations. The GDataframe stores the query result and enables in-memory manipulation of the data. The front-end provides also a module for loading and storing data, and a RemoteManager module, used for message interchange between the package and an external GMQL service. The back-end interacts with the front-end through a Manager module, which maps the operations specified in Python with the GMQL operators implemented in Spark
Fig. 2
Fig. 2
Relationships between GMQLDataset and GDataframe. Data can be imported into a GMQLDataset from a local GDM dataset with the load_from_path function. Using the load_from_file, it is possible to load generic BED files, while load_from_remote enables the loading of GDM datasets from an external GMQL repository, accessible through TCP connection. The user applies operation on the GMQLDataset and triggers the computation of the result with the materialize function. At the end of computation, the result is stored in-memory in a GDataframe, which can be then manipulated in Python. It is possible to import data directly from Pandas with from_pandas. Finally, it is possible to transform a GDataframe structure back into GMQLDataset using the to_GMQLDataset function
Fig. 3
Fig. 3
Deployment modes and executor options of the library. When the library is in remote mode, it interfaces with an external GMQL service, hosting a GMQL repository (accessible by the Python program, which has been deployed on several file systems). When the mode is set to local, the library can operate on various file systems, based on the selected master
Fig. 4
Fig. 4
Schematic representation of the deployment strategies adopted in the three applications. a Local/Remote system interaction for the analysis of ENCODE histone marks signal on promotorial regions. The gene dataset is stored in the local file system, the ENCODE BroadPeak database is hosted in the GMQL remote repository, deployed on the Hadoop file system with three slaves. b Configuration for the interactive analysis of the GWAS dataset against the whole set of enhancers from ENCODE. The library interacts directly with the YARN cluster and the data is stored in the Google Cloud File System with a fixed configuration of three slaves, accessed through the Hadoop engine. The gwas.tsv file is downloaded from the web and stored in the file system before executing the query. c Distributed setup for running the TICA query. Three datasets (from ENCODE and GENCODE) are in GDM format and stored in HDFS and the query runs on Amazon Web Services with a variable number of slave nodes, for evaluating the scalability of the system
Fig. 5
Fig. 5
Execution time for the TICA query on three different cell lines, with four different cluster configurations

Similar articles

Cited by

References

    1. Moorthie S, Hall A, Wright CF. Informatics and clinical genome sequencing: opening the black box. Genet Med. 2013;15(3):165. doi: 10.1038/gim.2012.116. - DOI - PubMed
    1. Masseroli M, et al. Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods. 2016;111:3–11. doi: 10.1016/j.ymeth.2016.09.002. - DOI - PubMed
    1. Masseroli M, et al. Genometric query language: a novel approach to large-scale genomic data management. Bioinformatics. 2015;31(12):1881–8. doi: 10.1093/bioinformatics/btv048. - DOI - PubMed
    1. Masseroli Marco, Canakoglu Arif, Pinoli Pietro, Kaitoua Abdulrahman, Gulino Andrea, Horlova Olha, Nanni Luca, Bernasconi Anna, Perna Stefano, Stamoulakatou Eirini, Ceri Stefano. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics. 2018;35(5):729–736. doi: 10.1093/bioinformatics/bty688. - DOI - PubMed
    1. Zaharia M, et al. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65. doi: 10.1145/2934664. - DOI

Substances

LinkOut - more resources