Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 1;39(1):btac804.
doi: 10.1093/bioinformatics/btac804.

Cloud-native distributed genomic pileup operations

Affiliations

Cloud-native distributed genomic pileup operations

Marek Wiewiórka et al. Bioinformatics. .

Abstract

Motivation: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes.

Results: Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way.

Availability and implementation: https://biodatageeks.github.io/sequila/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Reads-aware partitioning algorithm: original distributed partitions (A); read assignment (color coded) to original partitions according to alignment starting position (B); virtual partitions and their boundaries calculated by Algorithm 1 (C); coalesced partitions (D); and read assignment (color coded) to coalesced partitions and corresponding virtual partitions (E). Note that some of the reads will be processed in more than one coalesced partition. This approach produces on average equally sized virtual partitions (no data skewness) except for the first one and last one that are a bit larger and smaller than the rest, respectively
Fig. 2.
Fig. 2.
SeQuiLa extensions to Apache Spark Catalyst optimizer
Fig. 3.
Fig. 3.
SeQuiLa deployment on GKE with spark-on-k8s-operator with Kubernetes Custom Resource Definition, Prometheus for runtime metrics collection and Grafana as observability platform
Fig. 4.
Fig. 4.
Pileup summary function comparison. Tests were performed on a single node for ES (A), WGS (B), and on the Hadoop cluster for ES (C), WGS (D)
Fig. 5.
Fig. 5.
Impact of various optimizations techniques on overall performance (percentage of time reduction) as compared with baseline (all optimizations off) for pileup computation. First bar shows the performance gain when all optimizations are on
Fig. 6.
Fig. 6.
Depth of coverage function comparison. Tests were performed on a single node for ES (A), WGS (B), and on the Hadoop cluster for ES (C), WGS (D). SeQuiLa-pileup designates the execution time of the full pileup calculations; SeQuiLa-pileup-no-qual indicates the execution time of the simplified pileup calculations in which base qualities are not computed

References

    1. Ahmad T. et al. (2021) VC@scale: scalable and high-performance variant calling on cluster environments. GigaScience, 10. - PMC - PubMed
    1. Armbrust M. et al. (2015) Spark SQL. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, pp. 1383–1394.
    1. Boettiger C. (2015) An introduction to Docker for reproducible research. SIGOPS Oper. Syst. Rev., 49, 71–79.
    1. Bonfield J.K. et al. (2019) Crumble: reference free lossy compression of sequence quality values. Bioinformatics, 35, 337–339. - PMC - PubMed
    1. Capuccini M. et al. (2020) MaRe: processing big data with application containers on apache spark. GigaScience, 9. - PMC - PubMed

Publication types

Grants and funding