Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 7;10(9):giab057.
doi: 10.1093/gigascience/giab057.

VC@Scale: Scalable and high-performance variant calling on cluster environments

Affiliations

VC@Scale: Scalable and high-performance variant calling on cluster environments

Tanveer Ahmad et al. Gigascience. .

Abstract

Background: Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations.

Results: Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters.

Conclusions: We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.

Keywords: Apache Arrow; Apache Spark; BWA-MEM; DeepVariant; MarkDuplicate; sorting; whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1:
Figure 1:
Single-node total runtimes for complete variant-calling workflow using DeepVariant for different datasets.
Figure 2:
Figure 2:
A. Python programs in Spark require inefficient data serialization/deserialization between Python and JVM processes (using the Py4j library). B. Efficient data communication between frameworks/languages using Apache Arrow unified in-memory columnar data format with zero-copy overhead and different languages APIs/interfaces availability in Spark cluster.
Figure 3:
Figure 3:
Performance comparison of Pandas dataframe to PySpark dataframe conversion using Arrow and without Arrow and Python UDF (row-at-a-time) and Pandas vectorized UDF (using Apache Arrow) operations: plus one, cdf, and subtract mean.
Figure 4:
Figure 4:
Static load balancing technique adopted in this work for BWA-MEM output, which divides chromosome-based regions to join and process them in parallel for all further workflow stages.
Figure 5:
Figure 5:
Complete design flow of the variant-calling workflow implementation in VC@Scale. This design encompasses Slurm Spark/GCP DataProc cluster, Lustre/GCP Filestore as file system, Apache Arrow as in-memory data format for pre-processing, and DeepVariant as variant caller.
Figure 6:
Figure 6:
VC@Scale, SparkGA2, and ADAM comparisons of scalability for pre-processing stages using different number of nodes for ERR194147 (2×) dataset.
Figure 7:
Figure 7:
VC@Scale, SparkGA2, and ADAM comparisons of scalability for pre-processing stages using different number of nodes for ERR001268 dataset.
Figure 8:
Figure 8:
Single-node CPU-only and GPU accelerated DeepVariant for ERR194147 (30×) dataset.
Figure 9:
Figure 9:
Total runtime for DeepVariant-based complete variant-calling workflow (VC@Scale), which uses best performance combination of nodes. For both datasets pre-processing (BWA-MEM, sorting, and MarkDuplicate) uses 16 nodes while 32 nodes are used for DeepVariant.
Figure 10:
Figure 10:
VC@Scale-DeepVariant scalability for different datasets and the number of nodes used in each run.
Figure 11:
Figure 11:
GPUs accelerated VC@Scale-DeepVariant scalability for ERR194147 (30×) dataset.
Figure 12:
Figure 12:
SparkGA2 cluster-wide system resource utilization graph for pre-processing stages.
Figure 13:
Figure 13:
VC@Scale cluster-wide system resource utilization graph for pre-processing stages.

References

    1. Gropp W, Lusk E. Fault tolerance in message passing interface programs. Int J High Perform Comput Appl. 2004;18(3):363–72.
    1. Cappello F, Al G, Gropp W, et al. Toward exascale resilience: 2014 update. Supercomput Front Innov. 2014;1(1):5–28.
    1. Apache Apache Hadoop. 2019. https://hadoop.apache.org/. Accessed 2 April 2019.
    1. Decap D, Reumers J, Herzeel C, et al. Halvade: scalable sequence analysis with MapReduce. Bioinformatics. 2015;31(15):2482–8. - PMC - PubMed
    1. Apache. Apache Spark: Lightning-fast unified analytics engine. 2019. https://spark.apache.org/. Accessed 2 April 2019.

Publication types

LinkOut - more resources