Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 13;22(1):402.
doi: 10.1186/s12859-021-04317-y.

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow

Affiliations

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow

Jochen Bathke et al. BMC Bioinformatics. .

Abstract

Background: The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time.

Results: A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half.

Conclusions: The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.

Keywords: Benchmarking; Data parallelization; GATK; Java; Next generation sequencing; Reproducibility; SNP; Variant calling; indel.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Flowchart of the variant calling workflow. The variant calling workflow consists of two separate branches. A basic workflow already generates a set of usable, functionally annotated variants (SNPs and indels). A second, optional workflow, uses the previously called variants to perform Base Quality Score Recalibration (BQSR) to improve initial base calls of the fastq files. Processing of each individuals fastq files can be performed in parallel. Also various steps of the workflow can be parallelized, e.g. base calling on genomic intervals by the GATK HaplotypeCaller, as indicated by overlapping boxes. Each box includes a description of the step (light gray), the name of the used application (medium gray) and the primary input and output data formats (dark gray)
Fig. 2
Fig. 2
Resource usage benchmarking of GATK applications at different Java garbage collection thread counts. The performance of some GATK applications is severely influenced by the number of employed Java garbage collection (GC) threads. Each application was executed several times with different Java GC thread counts, intending to identify GC thread counts that result in minimal resource utilization. Here, the Java 8 default parallel garbage collector was used. Resource usage concerning wall time, system time and resident set size (memory usage) was analyzed (see rows) for the four tools SortSam, MarkDuplicates, HaplotypeCaller and GatherVcfs (see columns) (GATK version 4.1.9). Triplicated measurements for each of eight different numbers of GC thread counts (1, 2, 4, 6, 8, 12, 16 and 20) were recorded and resulting mean values plotted in lines. Lower measured values are preferable as they reflect a lower resource usage of the respective application. Runtime comparisons between different applications should not be performed here. The ordinate scales of individual plots vary greatly, to represent variances within an application as clearly as possible. Furthermore, SortSam, MarkDuplicates and GatherVcfs analyzed an entire dataset, while the HaplotypeCaller was limited to the analysis of chromosome 6 (NC_006093.5), thereby reducing the runtime from days to some hours
Fig. 3
Fig. 3
Influence of different Java heap sizes on the resource utilization of individual GATK applications. Besides the number of Java garbage collection threads, the provided heap size has a considerable impact on the performance of some GATK applications. Again, the four tools SortSam, MarkDuplicates, HaplotypeCaller and GatherVcfs (see columns) (GATK version 4.1.9) were assessed for their respective resource usage in terms of wall time, system time and memory usage (see rows). The intention was to identify Java heap sizes that result in minimized resource utilization. Therefore, lower readings on the ordinate are preferable as they reflect lower resource consumption of the respective application. Triplicate measurements were recorded for each of ten different values for Java heap size (1, 2, 4, 6, 8, 12, 16, 24, 32 and 48 Gb) and resulting mean values plotted in lines. The gray line in the resident set size plots indicate parity between the maximum allowed heap size and the actual memory usage. All measurements were recorded with two garbage collection threads enabled. As in Fig. 2, different scales of the ordinates of each plot have to be taken into account, since they vary considerably between the individual plots. In additon, the HaplotypeCaller was again limited to the analysis of chromosome 6 (NC_006093.5)
Fig. 4
Fig. 4
Resource consumption of the basic workflow with increasing optimization levels. a CPU and memory utilization of the entire workflow, using a single interval (comprising the entire genome) for the HaplotypeCaller and without any Java optimization (total runtime: 67.1 h)). Four phases can be distinguished within the workflow (separated by dashed lines), that are dominated by individual applications. b When the genome is split into six separate intervals for the HaplotypeCaller analysis, but without any Java optimization (41.4 h). c With optimized Java garbage collection for each GATK application (39.8 h). d With optimized Java settings (garbage collection and heap size) for all GATK applications and four default threads for the native pairHMM algorithm of the HaplotypeCaller (40.3 h). e When all optimizations are applied to the workflow, including six parallel intervals for variant calling by the HaplotypeCaller, a single hmmThread for each HaplotypeCaller, and all Java optimizations (garbage collection and heap size) (34.7 h)

Similar articles

Cited by

References

    1. Shastry BS. SNPs: impact on gene function and phenotype. Methods Mol Biol. 2009;578:3–22. doi: 10.1007/978-1-60327-411-1_1. - DOI - PubMed
    1. Lupski JR, Stankiewizy P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 2005;1:e49. doi: 10.1371/journal.pgen.0010049. - DOI - PMC - PubMed
    1. Kauppi L, Jeffreys AJ, Keeney S. Where the crossovers are: recombination distributions in mammals. Net Rev Genet. 2004;5:413–424. doi: 10.1038/nrg1346. - DOI - PubMed
    1. Zhang J, Chiodini R, Badr A, Zhang G. The impact of next-generation sequencing on genomics. J Genet Genomics. 2011;38:95–109. doi: 10.1016/j.jgg.2011.02.003. - DOI - PMC - PubMed
    1. Koboldt D, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38. doi: 10.1016/j.cell.2013.09.006. - DOI - PMC - PubMed

LinkOut - more resources