PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
- PMID: 31703553
- PMCID: PMC6842186
- DOI: 10.1186/s12859-019-3159-9
PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets
Abstract
Background: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.
Results: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.
Conclusions: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.
Keywords: Data scalability; Distribution transparency; Genomic data; Python; Tertiary data analysis.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures





Similar articles
-
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4. BMC Bioinformatics. 2022. PMID: 35392801 Free PMC article.
-
GenoMetric Query Language: a novel approach to large-scale genomic data management.Bioinformatics. 2015 Jun 15;31(12):1881-8. doi: 10.1093/bioinformatics/btv048. Epub 2015 Feb 3. Bioinformatics. 2015. PMID: 25649616
-
Data Management for Heterogeneous Genomic Datasets.IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7. IEEE/ACM Trans Comput Biol Bioinform. 2017. PMID: 27295683
-
Bioinformatics applications on Apache Spark.Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098. Gigascience. 2018. PMID: 30101283 Free PMC article. Review.
-
A proteomics sample metadata representation for multiomics integration and big data analysis.Nat Commun. 2021 Oct 6;12(1):5854. doi: 10.1038/s41467-021-26111-3. Nat Commun. 2021. PMID: 34615866 Free PMC article. Review.
Cited by
-
GenoSurf: metadata driven semantic search system for integrated genomic datasets.Database (Oxford). 2019 Jan 1;2019:baz132. doi: 10.1093/database/baz132. Database (Oxford). 2019. PMID: 31820804 Free PMC article.
-
GeMI: interactive interface for transformer-based Genomic Metadata Integration.Database (Oxford). 2022 Jun 3;2022:baac036. doi: 10.1093/database/baac036. Database (Oxford). 2022. PMID: 35657113 Free PMC article.
-
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4. BMC Bioinformatics. 2022. PMID: 35392801 Free PMC article.
-
Spatial patterns of CTCF sites define the anatomy of TADs and their boundaries.Genome Biol. 2020 Aug 12;21(1):197. doi: 10.1186/s13059-020-02108-x. Genome Biol. 2020. PMID: 32782014 Free PMC article.
-
Genomic data integration and user-defined sample-set extraction for population variant analysis.BMC Bioinformatics. 2022 Sep 29;23(1):401. doi: 10.1186/s12859-022-04927-0. BMC Bioinformatics. 2022. PMID: 36175857 Free PMC article.
References
-
- Masseroli Marco, Canakoglu Arif, Pinoli Pietro, Kaitoua Abdulrahman, Gulino Andrea, Horlova Olha, Nanni Luca, Bernasconi Anna, Perna Stefano, Stamoulakatou Eirini, Ceri Stefano. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics. 2018;35(5):729–736. doi: 10.1093/bioinformatics/bty688. - DOI - PubMed
-
- Zaharia M, et al. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65. doi: 10.1145/2934664. - DOI
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources