GenoMetric Query Language: a novel approach to large-scale genomic data management
- PMID: 25649616
- DOI: 10.1093/bioinformatics/btv048
GenoMetric Query Language: a novel approach to large-scale genomic data management
Abstract
Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art 'big data' computing strategies, with abstraction levels beyond available tool capabilities.
Results: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic 'big data' analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets.
Availability and implementation: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.
© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Similar articles
-
Data Management for Heterogeneous Genomic Datasets.IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7. IEEE/ACM Trans Comput Biol Bioinform. 2017. PMID: 27295683
-
Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688. Bioinformatics. 2019. PMID: 30101316
-
Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.Methods. 2016 Dec 1;111:3-11. doi: 10.1016/j.ymeth.2016.09.002. Epub 2016 Sep 13. Methods. 2016. PMID: 27637471
-
Federated sharing and processing of genomic datasets for tertiary data analysis.Brief Bioinform. 2021 May 20;22(3):bbaa091. doi: 10.1093/bib/bbaa091. Brief Bioinform. 2021. PMID: 34020536 Review.
-
Bioinformatics applications on Apache Spark.Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098. Gigascience. 2018. PMID: 30101283 Free PMC article. Review.
Cited by
-
Scalable analysis of multi-modal biomedical data.Gigascience. 2021 Sep 11;10(9):giab058. doi: 10.1093/gigascience/giab058. Gigascience. 2021. PMID: 34508579 Free PMC article.
-
PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata.bioRxiv [Preprint]. 2024 May 11:2023.08.15.551388. doi: 10.1101/2023.08.15.551388. bioRxiv. 2024. Update in: Gigascience. 2024 Jan 2;13:giae033. doi: 10.1093/gigascience/giae033. PMID: 37645717 Free PMC article. Updated. Preprint.
-
Accurate and highly interpretable prediction of gene expression from histone modifications.BMC Bioinformatics. 2022 Apr 26;23(1):151. doi: 10.1186/s12859-022-04687-x. BMC Bioinformatics. 2022. PMID: 35473556 Free PMC article.
-
GeMI: interactive interface for transformer-based Genomic Metadata Integration.Database (Oxford). 2022 Jun 3;2022:baac036. doi: 10.1093/database/baac036. Database (Oxford). 2022. PMID: 35657113 Free PMC article.
-
Pan-cancer analysis of somatic mutations and epigenetic alterations in insulated neighbourhood boundaries.PLoS One. 2020 Jan 16;15(1):e0227180. doi: 10.1371/journal.pone.0227180. eCollection 2020. PLoS One. 2020. PMID: 31945090 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources