. 2023 Jan 1;39(1):btac804.

doi: 10.1093/bioinformatics/btac804.

Cloud-native distributed genomic pileup operations

Marek Wiewiórka¹, Agnieszka Szmurło¹, Paweł Stankiewicz², Tomasz Gambin¹

Affiliations

¹ Institute of Computer Science, Warsaw University of Technology, Warsaw, Warsaw 00-661, Poland.
² Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.

PMID: 36515465
PMCID: PMC9848050
DOI: 10.1093/bioinformatics/btac804

Cloud-native distributed genomic pileup operations

Marek Wiewiórka et al. Bioinformatics. 2023.

. 2023 Jan 1;39(1):btac804.

doi: 10.1093/bioinformatics/btac804.

Authors

Marek Wiewiórka¹, Agnieszka Szmurło¹, Paweł Stankiewicz², Tomasz Gambin¹

Affiliations

¹ Institute of Computer Science, Warsaw University of Technology, Warsaw, Warsaw 00-661, Poland.
² Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.

PMID: 36515465
PMCID: PMC9848050
DOI: 10.1093/bioinformatics/btac804

Abstract

Motivation: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes.

Results: Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way.

Availability and implementation: https://biodatageeks.github.io/sequila/.

PubMed Disclaimer

Figures

**Fig. 1.**
Reads-aware partitioning algorithm: original distributed partitions (A); read assignment (color coded) to original partitions according to alignment starting position (B); virtual partitions and their boundaries calculated by Algorithm 1 (C); coalesced partitions (D); and read assignment (color coded) to coalesced partitions and corresponding virtual partitions (E). Note that some of the reads will be processed in more than one coalesced partition. This approach produces on average equally sized virtual partitions (no data skewness) except for the first one and last one that are a bit larger and smaller than the rest, respectively

**Fig. 2.**
SeQuiLa extensions to Apache Spark Catalyst optimizer

**Fig. 3.**
SeQuiLa deployment on GKE with spark-on-k8s-operator with Kubernetes Custom Resource Definition, Prometheus for runtime metrics collection and Grafana as observability platform

**Fig. 4.**
Pileup summary function comparison. Tests were performed on a single node for ES (A), WGS (B), and on the Hadoop cluster for ES (C), WGS (D)

**Fig. 5.**
Impact of various optimizations techniques on overall performance (percentage of time reduction) as compared with baseline (all optimizations off) for pileup computation. First bar shows the performance gain when all optimizations are on

**Fig. 6.**
Depth of coverage function comparison. Tests were performed on a single node for ES (A), WGS (B), and on the Hadoop cluster for ES (C), WGS (D). SeQuiLa-pileup designates the execution time of the full pileup calculations; SeQuiLa-pileup-no-qual indicates the execution time of the simplified pileup calculations in which base qualities are not computed

See this image and copyright information in PMC

References

1. Ahmad T. et al. (2021) VC@scale: scalable and high-performance variant calling on cluster environments. GigaScience, 10. - PMC - PubMed
1. Armbrust M. et al. (2015) Spark SQL. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, pp. 1383–1394.
1. Boettiger C. (2015) An introduction to Docker for reproducible research. SIGOPS Oper. Syst. Rev., 49, 71–79.
1. Bonfield J.K. et al. (2019) Crumble: reference free lossy compression of sequence quality values. Bioinformatics, 35, 337–339. - PMC - PubMed
1. Capuccini M. et al. (2020) MaRe: processing big data with application containers on apache spark. GigaScience, 9. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

Research University

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cloud-native distributed genomic pileup operations

Affiliations

Cloud-native distributed genomic pileup operations

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources