. 2023 Sep 21;24(1):354.

doi: 10.1186/s12859-023-05470-2.

Critical assessment of on-premise approaches to scalable genome analysis

Amira Al-Aamri^#¹, Syafiq Kamarul Azman^#¹, Gihan Daw Elbait^{2

3}, Habiba Alsafar^{3

4}, Andreas Henschel^{5

6}

Affiliations

¹ Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
² Department of Biology, College of Arts and Sciences, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
³ Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
⁴ Department of Biomedical Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
⁵ Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates. Andreas.henschel@ku.ac.ae.
⁶ Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates. Andreas.henschel@ku.ac.ae.

^# Contributed equally.

PMID: 37735350
PMCID: PMC10512525
DOI: 10.1186/s12859-023-05470-2

Critical assessment of on-premise approaches to scalable genome analysis

Amira Al-Aamri et al. BMC Bioinformatics. 2023.

. 2023 Sep 21;24(1):354.

doi: 10.1186/s12859-023-05470-2.

Authors

Amira Al-Aamri^#¹, Syafiq Kamarul Azman^#¹, Gihan Daw Elbait^{2

3}, Habiba Alsafar^{3

4}, Andreas Henschel^{5

6}

Affiliations

¹ Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
² Department of Biology, College of Arts and Sciences, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
³ Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
⁴ Department of Biomedical Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
⁵ Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates. Andreas.henschel@ku.ac.ae.
⁶ Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates. Andreas.henschel@ku.ac.ae.

^# Contributed equally.

PMID: 37735350
PMCID: PMC10512525
DOI: 10.1186/s12859-023-05470-2

Abstract

Background: Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases.

Methods: In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability.

Results: Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database.

Conclusion: The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.

Keywords: Big data; Genomic data science; Genomic databases; Horizontal scaling; NoSQL; SQL; VCF.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Fig. 1**
The general workflow of a genomics data science solution. The input is a VCF file after a variant calling pipeline which could undergo transformation into a storage system. Variants are then annotated with a variety of sources and fed back into the storage. The contents of the VCF file can be queried via a client or a program for later analysis

**Fig. 2**
Query performance comparison for all studied tools to query for a unique variant by its identifier with and without providing the chromosome. Chromosome regions are shown as bands of dark and light rectangles. BCFtools and GEMINI results are presented in a log scale: as the query time between chromosome-bound queries and regular queries differ by order of magnitude, the log scale is more favorable to display the intricate patterns when querying with region indexing

**Fig. 3**
Query performance comparison between all studied tools to query for all INDEL-typed variants located in chromosome 5

**Fig. 4**
Query performance comparison between all studied tools to query for all variant sites where all samples in the study have homozygous genotype

**Fig. 5**
Time (in hours) taken by the studied tools to annotate the variants by patients and controls’ allele frequency. The annotation time is shown for a different number of samples

**Fig. 6**
Query performance comparison of studied tools for different numbers of samples to retrieve all variants that appear in more than 40% of control samples and less than or equal to 40% of patient samples

See this image and copyright information in PMC

Cited by

Analysis-ready VCF at Biobank scale using Zarr.
Czech E, Millar TR, Tyler W, White T, Elsworth B, Guez J, Hancox J, Jeffery B, Karczewski KJ, Miles A, Tallman S, Unneberg P, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Czech E, et al. bioRxiv [Preprint]. 2025 Feb 6:2024.06.11.598241. doi: 10.1101/2024.06.11.598241. bioRxiv. 2025. Update in: Gigascience. 2025 Jan 6;14:giaf049. doi: 10.1093/gigascience/giaf049. PMID: 38915693 Free PMC article. Updated. Preprint.
Analysis-ready VCF at Biobank scale using Zarr.
Czech E, Tyler W, White T, Jeffery B, Millar TR, Elsworth B, Guez J, Hancox J, Karczewski KJ, Miles A, Tallman S, Unneberg P, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Czech E, et al. Gigascience. 2025 Jan 6;14:giaf049. doi: 10.1093/gigascience/giaf049. Gigascience. 2025. PMID: 40451243 Free PMC article.

References

1. Hartung T. Making big sense from big data. Front Big Data. 2018;1:5. - PMC - PubMed
1. Ku CS, Loy EY, Salim A, Pawitan Y, Chia KS. The discovery of human genetic variations and their use as disease markers: past, present and future. J Hum Genet. 2010;55(7):403–415. - PubMed
1. Adetunji MO, Lamont SJ, Abasht B, Schmidt CJ. Variant analysis pipeline for accurate detection of genomic variants from transcriptome sequencing data. PLoS ONE. 2019;14(9):e0216838. - PMC - PubMed
1. Paila U, Chapman BA, Kirchner R, Quinlan AR. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput Biol. 2013;9(7):e1003153. - PMC - PubMed
1. Chellappa SA, Pathak AK, Sinha P, Jainarayanan AK, Jain S, Brahmachari SK. Meta-analysis of genomic variants and gene expression data in schizophrenia suggests the potential need for adjunctive therapeutic interventions for neuropsychiatric disorders. J Genet. 2019;98(2):1–13. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Critical assessment of on-premise approaches to scalable genome analysis

Affiliations

Critical assessment of on-premise approaches to scalable genome analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources