Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 27;9(1):9345.
doi: 10.1038/s41598-019-45835-3.

Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers

Affiliations

Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers

Jiayun Chen et al. Sci Rep. .

Abstract

The development and innovation of next generation sequencing (NGS) and the subsequent analysis tools have gain popularity in scientific researches and clinical diagnostic applications. Hence, a systematic comparison of the sequencing platforms and variant calling pipelines could provide significant guidance to NGS-based scientific and clinical genomics. In this study, we compared the performance, concordance and operating efficiency of 27 combinations of sequencing platforms and variant calling pipelines, testing three variant calling pipelines-Genome Analysis Tool Kit HaplotypeCaller, Strelka2 and Samtools-Varscan2 for nine data sets for the NA12878 genome sequenced by different platforms including BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants calling performance of 12 combinations in WES datasets, all combinations displayed good performance in calling SNPs, with their F-scores entirely higher than 0.96, and their performance in calling INDELs varies from 0.75 to 0.91. And all 15 combinations in WGS datasets also manifested good performance, with F-scores in calling SNPs were entirely higher than 0.975 and their performance in calling INDELs varies from 0.71 to 0.93. All of these combinations manifested high concordance in variant identification, while the divergence of variants identification in WGS datasets were larger than that in WES datasets. We also down-sampled the original WES and WGS datasets at a series of gradient coverage across multiple platforms, then the variants calling period consumed by the three pipelines at each coverage were counted, respectively. For the GIAB datasets on both BGI and Illumina platforms, Strelka2 manifested its ultra-performance in detecting accuracy and processing efficiency compared with other two pipelines on each sequencing platform, which was recommended in the further promotion and application of next generation sequencing technology. The results of our researches will provide useful and comprehensive guidelines for personal or organizational researchers in reliable and consistent variants identification.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
The flowchart of combinations using different sequencers and variant calling pipelines for germline variants. This workflow diagram reflects the designed comparison processes of the variants calling combinations. Key process for NGS data analysis were shown on the right. Squares in the flowchart represent data files, and rhombus indicate processes (the rhombus with dotted line mean that process were optional). After library preparation, samples are sequenced on multiple platforms to produce the raw datasets. The next steps are quality assessment and read alignment against a reference genome, followed by marking duplicates and sorting. Analysis-ready files of different platforms are analyzed by three variants calling pipelines using author-recommended parameters to generate VCF files, which were used for the final performance comparison of different combinations.
Figure 2
Figure 2
The summary of variants calling performances of multiple combinations in WES datasets. (A) Accuracy of SNPs (up) and INDELs (down) calls for BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq WES datasets. Large solid circles indicate the pass threshold of each combination. (B) Box plot indicates the distribution of F-score of multiple combinations in calling SNPs (up) and INDELs (down).
Figure 3
Figure 3
Intersection of variant Calling results of all combinations in SNPs and INDELs of WES datasets. The top bar-plot indicates the intersection size. This plot provides the number of variants that are uniquely called by one tool (a single point) or the numbers of variants called by many tools (two or more points). The bottom left plot indicates the set size. The linked points below display the intersecting sets of interest or which tools called variants. (A) UpSetR plot indicates intersection of variant calling results of all combinations in SNPs. (B) UpSetR plot indicates intersection of variant calling results of all combinations in INDELs.
Figure 4
Figure 4
Variant calling runtime of multiple combinations in WES datasets. The variant calling runtime of each combination run on a Tianhe-2 supercomputer with 24 virtual CPUs and 88 GiB of memory. All combinations were configured to schedule tasks over all 24 virtual CPUs. The coverages of the down-sampled datasets were approximately 20X, 40X, 60X, 80X, 100X, respectively. Among them, variants calling in SK2 sets 24 threads, GATK used the default setting thread in BQSR and 24 threads setting in variants calling, and SV used the default setting thread in mpileup and variants calling.
Figure 5
Figure 5
The summary of variants calling performances of multiple combinations in WGS datasets. (A) Accuracy of SNPs (up) and INDELs (down) calls for BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and Xten WGS datasets. Large solid circles indicate the pass threshold of each combination. (B) Box plot indicates the distribution of F-score of multiple combinations in calling SNPs (up) and INDELs (down).
Figure 6
Figure 6
Intersection of variant Calling Results of all Combinations in SNPs and INDELs of WGS datasets. The top bar-plot indicates the intersection size. This plot provides the number of variants that are uniquely called by one tool (a single point) or the numbers of variants called by many tools (two or more points). The bottom left plot indicates the set size. The linked points below display the intersecting sets of interest or which tools called variants. (A) UpSetR plot indicates intersection of variant calling results of all combinations in SNPs. (B) UpSetR plot indicates intersection of variant calling results of all combinations in INDELs.
Figure 7
Figure 7
Variant calling runtime of multiple combinations in WGS datasets. The storage footprints and variant calling runtime of each combination run on a Tianhe-2 supercomputer with 24 virtual CPUs and 88 GiB of memory. All combinations were configured to schedule tasks over all 24 virtual CPUs. The coverages of the down-sampled datasets were approximately 6X, 12X, 18X, 24X, 30X, respectively. Among them, variants calling in SK2 sets 24 threads, GATK used the default setting thread in BQSR and 24 threads setting in variants calling, and SV used the default setting thread in mpileup and variants calling.

References

    1. Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. - DOI - PMC - PubMed
    1. International HapMap, C A haplotype map of the human genome. Nature. 2005;437:1299–320. doi: 10.1038/nature04226. - DOI - PMC - PubMed
    1. International HapMap, C A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–61. doi: 10.1038/nature06258. - DOI - PMC - PubMed
    1. Koboldt DC, et al. The next-generation sequencing revolution and its impact on genomics. Cell. 2005;155:27–38. doi: 10.1016/j.cell.2013.09.006. - DOI - PMC - PubMed
    1. Saunders CJ, et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med. 2012;4:154ra135. doi: 10.1126/scitranslmed.3004041. - DOI - PMC - PubMed

Publication types

MeSH terms