. 2021 Sep 7;10(9):giab057.

doi: 10.1093/gigascience/giab057.

VC@Scale: Scalable and high-performance variant calling on cluster environments

Tanveer Ahmad¹, Zaid Al Ars¹, H Peter Hofstee^{1

2}

Affiliations

¹ Faculty of Electrical Engineering, Mathematics and Computer Science, Quantum & Computer Engineering Department, Mekelweg 4, 2628 CD Delft, Netherlands.
² IBM Austin, TX, USA.

PMID: 34494101
PMCID: PMC8424057
DOI: 10.1093/gigascience/giab057

VC@Scale: Scalable and high-performance variant calling on cluster environments

Tanveer Ahmad et al. Gigascience. 2021.

. 2021 Sep 7;10(9):giab057.

doi: 10.1093/gigascience/giab057.

Authors

Tanveer Ahmad¹, Zaid Al Ars¹, H Peter Hofstee^{1

2}

Affiliations

¹ Faculty of Electrical Engineering, Mathematics and Computer Science, Quantum & Computer Engineering Department, Mekelweg 4, 2628 CD Delft, Netherlands.
² IBM Austin, TX, USA.

PMID: 34494101
PMCID: PMC8424057
DOI: 10.1093/gigascience/giab057

Abstract

Background: Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations.

Results: Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters.

Conclusions: We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.

Keywords: Apache Arrow; Apache Spark; BWA-MEM; DeepVariant; MarkDuplicate; sorting; whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1:**
Single-node total runtimes for complete variant-calling workflow using DeepVariant for different datasets.

**Figure 2:**
A. Python programs in Spark require inefficient data serialization/deserialization between Python and JVM processes (using the Py4j library). B. Efficient data communication between frameworks/languages using Apache Arrow unified in-memory columnar data format with zero-copy overhead and different languages APIs/interfaces availability in Spark cluster.

**Figure 3:**
Performance comparison of Pandas dataframe to PySpark dataframe conversion using Arrow and without Arrow and Python UDF (row-at-a-time) and Pandas vectorized UDF (using Apache Arrow) operations: plus one, cdf, and subtract mean.

**Figure 4:**
Static load balancing technique adopted in this work for BWA-MEM output, which divides chromosome-based regions to join and process them in parallel for all further workflow stages.

**Figure 5:**
Complete design flow of the variant-calling workflow implementation in VC@Scale. This design encompasses Slurm Spark/GCP DataProc cluster, Lustre/GCP Filestore as file system, Apache Arrow as in-memory data format for pre-processing, and DeepVariant as variant caller.

**Figure 6:**
VC@Scale, SparkGA2, and ADAM comparisons of scalability for pre-processing stages using different number of nodes for ERR194147 (2×) dataset.

**Figure 7:**
VC@Scale, SparkGA2, and ADAM comparisons of scalability for pre-processing stages using different number of nodes for ERR001268 dataset.

**Figure 8:**
Single-node CPU-only and GPU accelerated DeepVariant for ERR194147 (30×) dataset.

**Figure 9:**
Total runtime for DeepVariant-based complete variant-calling workflow (VC@Scale), which uses best performance combination of nodes. For both datasets pre-processing (BWA-MEM, sorting, and MarkDuplicate) uses 16 nodes while 32 nodes are used for DeepVariant.

**Figure 10:**
VC@Scale-DeepVariant scalability for different datasets and the number of nodes used in each run.

**Figure 11:**
GPUs accelerated VC@Scale-DeepVariant scalability for ERR194147 (30×) dataset.

**Figure 12:**
SparkGA2 cluster-wide system resource utilization graph for pre-processing stages.

**Figure 13:**
VC@Scale cluster-wide system resource utilization graph for pre-processing stages.

See this image and copyright information in PMC

References

1. Gropp W, Lusk E. Fault tolerance in message passing interface programs. Int J High Perform Comput Appl. 2004;18(3):363–72.
1. Cappello F, Al G, Gropp W, et al. Toward exascale resilience: 2014 update. Supercomput Front Innov. 2014;1(1):5–28.
1. Apache Apache Hadoop. 2019. https://hadoop.apache.org/. Accessed 2 April 2019.
1. Decap D, Reumers J, Herzeel C, et al. Halvade: scalable sequence analysis with MapReduce. Bioinformatics. 2015;31(15):2482–8. - PMC - PubMed
1. Apache. Apache Spark: Lightning-fast unified analytics engine. 2019. https://spark.apache.org/. Accessed 2 April 2019.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

VC@Scale: Scalable and high-performance variant calling on cluster environments

Affiliations

VC@Scale: Scalable and high-performance variant calling on cluster environments

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources