. 2018 Jun 1;7(6):giy052.

doi: 10.1093/gigascience/giy052.

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Xiaobo Sun¹, Jingjing Gao², Peng Jin³, Celeste Eng⁴, Esteban G Burchard⁴, Terri H Beaty⁵, Ingo Ruczinski⁶, Rasika A Mathias⁷, Kathleen Barnes⁸, Fusheng Wang⁹, Zhaohui S Qin^{2

10}; CAAPA consortium

Affiliations

¹ Department of Computer Sciences, Emory University, Atlanta, GA 30322, USA.
² Department of Medical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA.
³ Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA.
⁴ Department of Medicine, University of California, San Francisco, San Francisco, CA 94143 USA.
⁵ Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD 21205 USA.
⁶ Department of Biostatistics, Bloomberg School of Public Health, JHU, Baltimore, MD 21205 USA.
⁷ Department of Medicine, Johns Hopkins University, Baltimore, MD 21224 USA.
⁸ Department of Medicine, University of Colorado, Denver, Aurora, CO, 80045 USA.
⁹ Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA.
¹⁰ Department of Biostatistics, Emory University, Atlanta, GA 30322, USA.

PMID: 29762754
PMCID: PMC6007233
DOI: 10.1093/gigascience/giy052

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Xiaobo Sun et al. Gigascience. 2018.

. 2018 Jun 1;7(6):giy052.

doi: 10.1093/gigascience/giy052.

Authors

Affiliations

¹ Department of Computer Sciences, Emory University, Atlanta, GA 30322, USA.
² Department of Medical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA.
³ Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA.
⁴ Department of Medicine, University of California, San Francisco, San Francisco, CA 94143 USA.
⁵ Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD 21205 USA.
⁶ Department of Biostatistics, Bloomberg School of Public Health, JHU, Baltimore, MD 21205 USA.
⁷ Department of Medicine, Johns Hopkins University, Baltimore, MD 21224 USA.
⁸ Department of Medicine, University of Colorado, Denver, Aurora, CO, 80045 USA.
⁹ Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA.
¹⁰ Department of Biostatistics, Emory University, Atlanta, GA 30322, USA.

PMID: 29762754
PMCID: PMC6007233
DOI: 10.1093/gigascience/giy052

Abstract

Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance.

Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools.

Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

PubMed Disclaimer

Figures

**Figure 1:**
Merging multiple VCF files into a single TPED file. Left tables represent input VCF files. The table on the right represents the merged TPED file. Records are filtered out if their Filter value is not equal to “PASS” (Pos 10 147). Individual genotypes from multiple VCF files with the same genomic location are aggregated together in one row. The resulting TPED file thus has an inclusive set of sorted genomic locations of all variants found in the input VCF files.

**Figure 2:**
The workflow chart of the MapReduce schema. The workflow is divided into two phases. In the first phase, variants are filtered, grouped by chromosomes into bins, and mapped into key-value records. Two sampling steps are implemented to generate partition lists of all chromosomes. In the second phase, parallel jobs of specified chromosomes are launched. Within each job, records from corresponding bins are loaded, partitioned, sorted, and merged by genomic locations before being saved into a TPED file.

**Figure 3:**
The workflow chart of the HBase schema. The workflow is divided into three phases. The first is a sampling, filtering, and mapping phase. A MapReduce job samples out variants whose genomic positions are used as region boundaries when creating the HBase table. Only qualified records are mapped as key-values and saved as Hadoop sequence files. The second is the HBase bulk loading phase in which a MapReduce job loads and writes records generated from the previous phase, aggregating them into corresponding regional HFiles in the form of HBase's row key and column families. Finished HFiles are moved into HBase data storage folders on region servers. In the third phase, parallel scans were launched over regions of the whole table to retrieve desired records that are subsequently merged and exported to the TPED file.

**Figure 4:**
The workflow chart of the Spark schema. The workflow is divided into three stages. In the first stage, VCF records are loaded, filtered, and mapped to pairRDDs with keys of genomic position and values of genotype. The sort-by-key shuffling spans across the first two stages, sorting and grouping together records by keys. Then, grouped records with the same key are locally merged into one record in TPED format. Finally, merged records are exported to the TPED file.

**Figure 5:**
The execution plan of the HPC-based implementation. The execution plan resembles a branched tree. In the first round, each process is assigned an approximately equal number of files to merge locally. In the second round, the even-numbered process retrieves the merged file of its right adjacent process to merge with its local merged file. In the third round, processes with ID numbers that can be fully divided by 4 retrieve the merged file of its right adjacent process in the second round and do the merging. This process continues recursively until all files are merged into a single TPED file (round four).

**Figure 6:**
The scalability of Apache cluster-based schemas on input data size. As the number of input files increases from 10 to 186, the time costs of all three schemas with 12, 24, or 72 cores increase at a slower pace than that of the input data size, especially when the number of cores is relatively large. The HBase schema with 12 cores has the largest increase (from 375 to 5479 seconds, ∼14.6 fold).

**Figure 7:**
Comparing the strong scalability between traditional parallel/distributed methods and Apache cluster-based schemas. We fix the number of files at 93 and increase the number of nodes/cores. The baseline for the parallel multiway-merge is one single core, while for the others it is one single node (four cores). All methods/schemas show a degraded efficiency as computing resources increase 16 fold from the baseline. Specifically, the efficiency of MapReduce-, HBase-, and Spark-based schemas drops to 0.83, 0.63, and 0.61, respectively, while the efficiency of parallel multiway-merge and HPC-based implementations drops to 0.06 and 0.53, respectively.

**Figure 8:**
Comparing the weak scalability between traditional parallel/distributed methods and Apache cluster-based schemas. We simultaneously increase the number of cores and input data sizes while fixing the ratio of file/core (parallel multiway-merge) or file/node (all others) at 10. The baseline is the same as in the test of strong scalability. All but the MapReduce-based schema have degraded efficiency, among which the HPC-based implementation has the steepest degradation. Specifically, when computing resource increases 16 fold from the baseline, the efficiency of MapReduce-, HBase-, and Spark-based schemas changes to 3.1, 0.87, and 0.75, respectively, and for parallel multiway-merge and HPC-based implementations, the efficiency reduces to 0.42 and 0.35, respectively.

**Figure 9:**
The performance anatomy of cluster-based schemas on increasing input data size. The number of cores in these experiments is fixed at 48. Time costs of all phases of the three schemas have a linear or sublinear correlation with the input data size. (a) MapReduce schema: The two MapReduce phases have a comparable time cost, increasing 6.3- and 3.1-fold, respectively, as the number of input files increases from 10 to 186. (b) HBase schema: The time spent in each phase increases 4.2-, 5.6-,and 5.0-fold, respectively, as the number of input files increases from 10 to 186. The bulk loading and exporting phases together take up more than 80% of total time expense. (c) Spark schema: The time cost increases 5.8-, 6.0-,and 6.0-fold, respectively, for the three stages as the number of input files increases from 10 to 186 files. Like the HBase schema, the first two stages of the Spark schema together account for more than 80% of the total time cost.

**Figure 10:**
Execution speed comparison among Apache cluster-based schemas and traditional methods. First, we compare the speeds of the three Apache schemas with that of three traditional methods, which are single-process multiway-merge, parallel multiway-merge, and HPC-based implementations. As the number of input files increases from 10 to 186, the speeds of Apache cluster-based schemas improve much more significantly than that of traditional methods. The numbers in the figures indicate the ratio of the time cost of each traditional method to that of the fastest Apache cluster-based schema. Second, we compare the processing speed among the three Apache cluster-based schemas, which are comparable to each other regardless of the input data size. The MapReduce schema performs the best in merging 10 and 186 files; the HBase schema performs the best in merging 20, 40, and 60 files; and the Spark schema performs the best in merging 93 files.

See this image and copyright information in PMC

References

1. Massie M, Nothaft F, Hartl C et al. . Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No UCB/EECS-2013 2013;207.
1. Siretskiy A, Sundqvist T, Voznesenskiy M, et al. . A quantitative assessment of the hadoop framework for analyzing massively parallel DNA sequencing data. GigaScience. 2015;4(1):26. - PMC - PubMed
1. Merelli I, Pérez-Sánchez H, Gesing S et al. . Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives. BioMed Research International. 2014: 134023. - PMC - PubMed
1. Reyes-Ortiz JL, Oneto L, Anguita D. Big data analytics in the cloud: Spark on hadoop vs mpi/openmp on beowulf. Procedia Computer Science. 2015;53:121–30.
1. Burren OS, Guo H, Wallace C. VSEAMS: a pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes. Bioinformatics. 2014;30(23):3342–8. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Affiliations

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous