Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar 27;5(3):28.
doi: 10.1186/gm432. eCollection 2013.

Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing

Affiliations

Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing

Jason O'Rawe et al. Genome Med. .

Abstract

Background: To facilitate the clinical implementation of genomic medicine by next-generation sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be.

Methods: We sequenced 15 exomes from four families using commercial kits (Illumina HiSeq 2000 platform and Agilent SureSelect version 2 capture kit), with approximately 120X mean coverage. We analyzed the raw data using near-default parameters with five different alignment and variant-calling pipelines (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMtools). We additionally sequenced a single whole genome using the sequencing and analysis pipeline from Complete Genomics (CG), with 95% of the exome region being covered by 20 or more reads per base. Finally, we validated 919 single-nucleotide variations (SNVs) and 841 insertions and deletions (indels), including similar fractions of GATK-only, SOAP-only, and shared calls, on the MiSeq platform by amplicon sequencing with approximately 5000X mean coverage.

Results: SNV concordance between five Illumina pipelines across all 15 exomes was 57.4%, while 0.5 to 5.1% of variants were called as unique to each pipeline. Indel concordance was only 26.8% between three indel-calling pipelines, even after left-normalizing and intervalizing genomic coordinates by 20 base pairs. There were 11% of CG variants falling within targeted regions in exome sequencing that were not called by any of the Illumina-based exome analysis pipelines. Based on targeted amplicon sequencing on the MiSeq platform, 97.1%, 60.2%, and 99.1% of the GATK-only, SOAP-only and shared SNVs could be validated, but only 54.0%, 44.6%, and 78.1% of the GATK-only, SOAP-only and shared indels could be validated. Additionally, our analysis of two families (one with four individuals and the other with seven), demonstrated additional accuracy gained in variant discovery by having access to genetic data from a multi-generational family.

Conclusions: Our results suggest that more caution should be exercised in genomic medicine settings when analyzing individual genomes, including interpreting positive and negative findings with scrutiny, especially for indels. We advocate for renewed collection and sequencing of multi-generational families to increase the overall accuracy of whole genomes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling pipelines. The alignment method used, followed by the SNV variant calling algorithm is annotated here in shorthand: BWA-GATK, SOAP-Align-SOAPsnp, BWA-SNVer, BWA-SAMtools, and GNUMAP-GNUMAP. (A) Mean SNV concordance between each pipeline was determined by matching the genomic coordinate as well as the base-pair change and zygosity for each detected SNV. (B) The same analysis as in (A) but filtered to include only SNVs already found in dbSNP135. (C) The same analysis as in (A), but filtered to include novel SNVs (that is, SNVs not found in dbSNP135).
Figure 2
Figure 2
Single-nucleotide variant (SNV) concordance, between two sequencing pipelines (Illumina and Complete Genomics (CG)) for a single exome, k8101-49685. For the Illumina sequencing, exons were captured using the Agilent SureSelect version 2 panel of capture probes. CG SNVs consisted of a subset of all SNVs called by CG that fell within the Agilent SureSelect version 2 exons. Concordance was determined by matching the genomic coordinates, base-pair composition, and zygosity status for each detected SNVs. Illumina SNVs consisted of all SNVs (the union) called by the five variant-calling pipelines GATK, SAMtools, SOAPsnp, SNVer, and GNUMAP, which increased the false positives but decreased the false negatives. Concordance was measured between Illumina SNVs and (A) all CG SNVs, (C) only high-quality (VQHIGH) CG SNVs, and (D) only low quality (VQLOW) CG SNVs. (B) Genome mappability analyses were performed on 2,085 discordant SNVs, which were found by the CG pipeline and not found by any of the five Illumina data pipelines.
Figure 3
Figure 3
Mean indel concordance over 15 exomes between 3 indel-calling pipelines: GATK, SOAPindel, and SAMtools. Mean concordance was measured between (A) all indels, (B) known indels (indels found in dbSNP135), and (C) unknown indels (indels not found in dbSNP135). Indels were left normalized and intervalized to a range of 20 genomic coordinates (10 coordinates on each side of the normalized position) to allow for a reasonably standardized indel metric for comparison. To determine whether or not indels were matching, the genomic coordinates as well as the base length and composition of each indel were considered.
Figure 4
Figure 4
Indel concordance for a single exome, k8101-49685, between two sequencing pipelines: Illumina and Complete Genomics (CG). Illumina indels consist of a union of all indels called by each of the three indel-calling pipelines GATK, SOAPindel, and SAMtools. CG indels consisted of a subset of indels called by CG that fell within the Agilent SureSelect version 2 exons. Both Illumina and CG indels were left normalized and intervalized to a range of 20 genomic coordinates (10 coordinates on each side of the normalized position). To determine whether or not indels were matching, the genomic coordinates as well as the base length and composition of each indel were considered.
Figure 5
Figure 5
MiSeq validation experiment on a subset of Illumina-data calls. A total of 1,140 SNVs from sample k8101-49685 were randomly sampled for MiSeq validation, with 380 sampled from the set of unique-to-GATK SNVs, 380 sampled from the set of unique-to-SOAPsnp SNVs, and 380 sampled from the set that were overlapping between these two pipelines. There were 919 (81.0%) of these variants that were successfully amplified and sequenced. BWA version 0.6.2 and GATK version 2.3-9 were used to process the sequencing data for variant-calling. Additionally, 960 indels from sample k8101-49685 were randomly selected for validation. Of these, 386 were randomly selected from the unique-to-GATK indel set, 387 were randomly selected from the unique-to-SOAPindel set, and 187 were randomly selected from the set of indels overlapping between the two (SOAPindel and GATK). There were 841 (83.5%)of these indels that were successfully amplified and sequenced. BWA version 0.6.2 and GATK version 2.3-9 were used to determine the number of variants that were also successfully validated across these sets.

References

    1. Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012. - PMC - PubMed
    1. Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall MD, Nangle K, Wang J, Abecasis G, Cardon LR, Zollner S, Whittaker JC, Chissoe SL, Novembre J. et al.An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337:100–104. doi: 10.1126/science.1217876. - DOI - PMC - PubMed
    1. Olson MV. Human genetic individuality. Annual review of genomics and human genetics. 2012;13:1–27. doi: 10.1146/annurev-genom-090711-163825. - DOI - PubMed
    1. Bearn AG. Archibald Garrod and the individuality of Man. Oxford, New York: Clarendon Press; Oxford University Press; 1993.
    1. Ball MP, Thakuria JV, Zaranek AW, Clegg T, Rosenbaum AM, Wu X, Angrist M, Bhak J, Bobe J, Callow MJ, Cano C, Chou MF, Chung WK, Douglas SM, Estep PW, Gore A, Hulick P, Labarga A, Lee JH, Lunshof JE, Kim BC, Kim JI, Li Z, Murray MF, Nilsen GB, Peters BA, Raman AM, Rienhoff HY, Robasky K, Wheeler MT. et al.A public resource facilitating clinical use of genomes. Proceedings of the National Academy of Sciences of the United States of America. 2012;109:11920–11927. doi: 10.1073/pnas.1201904109. - DOI - PMC - PubMed