Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software

Daniel L Cameron^{1

2}, Leon Di Stefano¹, Anthony T Papenfuss^{3

4

5

6

7}

Affiliations

¹ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Pde, Parkville, VIC, 3052, Australia.
² Department of Medical Biology, University of Melbourne, Parkville, VIC, 3010, Australia.
³ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Pde, Parkville, VIC, 3052, Australia. papenfuss@wehi.edu.au.
⁴ Department of Medical Biology, University of Melbourne, Parkville, VIC, 3010, Australia. papenfuss@wehi.edu.au.
⁵ Peter MacCallum Cancer Centre, Victorian Comprehensive Cancer Centre, Melbourne, VIC, 3000, Australia. papenfuss@wehi.edu.au.
⁶ Sir Peter MacCallum Department of Oncology, University of Melbourne, Parkville, VIC, 3010, Australia. papenfuss@wehi.edu.au.
⁷ School of Mathematics and Statistics, University of Melbourne, Parkville, VIC, 3010, Australia. papenfuss@wehi.edu.au.

PMID: 31324872
PMCID: PMC6642177
DOI: 10.1038/s41467-019-11146-4

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software

Daniel L Cameron et al. Nat Commun. 2019.

. 2019 Jul 19;10(1):3240.

doi: 10.1038/s41467-019-11146-4.

Authors

Daniel L Cameron^{1

2}, Leon Di Stefano¹, Anthony T Papenfuss^{3

4

5

6

7}

Affiliations

¹ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Pde, Parkville, VIC, 3052, Australia.
² Department of Medical Biology, University of Melbourne, Parkville, VIC, 3010, Australia.
³ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Pde, Parkville, VIC, 3052, Australia. papenfuss@wehi.edu.au.
⁴ Department of Medical Biology, University of Melbourne, Parkville, VIC, 3010, Australia. papenfuss@wehi.edu.au.
⁵ Peter MacCallum Cancer Centre, Victorian Comprehensive Cancer Centre, Melbourne, VIC, 3000, Australia. papenfuss@wehi.edu.au.
⁶ Sir Peter MacCallum Department of Oncology, University of Melbourne, Parkville, VIC, 3010, Australia. papenfuss@wehi.edu.au.
⁷ School of Mathematics and Statistics, University of Melbourne, Parkville, VIC, 3010, Australia. papenfuss@wehi.edu.au.

PMID: 31324872
PMCID: PMC6642177
DOI: 10.1038/s41467-019-11146-4

Abstract

In recent years, many software packages for identifying structural variants (SVs) using whole-genome sequencing data have been released. When published, a new method is commonly compared with those already available, but this tends to be selective and incomplete. The lack of comprehensive benchmarking of methods presents challenges for users in selecting methods and for developers in understanding algorithm behaviours and limitations. Here we report the comprehensive evaluation of 10 SV callers, selected following a rigorous process and spanning the breadth of detection approaches, using high-quality reference cell lines, as well as simulations. Due to the nature of available truth sets, our focus is on general-purpose rather than somatic callers. We characterise the impact on performance of event size and type, sequencing characteristics, and genomic context, and analyse the efficacy of ensemble calling and calibration of variant quality scores. Finally, we provide recommendations for both users and methods developers.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Performance varies widely among callers, but the assembly-based callers manta and GRIDSS consistently perform well. For each caller, the precision (1-false discovery rate) is plotted against the number of true positives/recall as variant quality threshold varies. The read count is used as a proxy for quality when quality score is not reported. Some callers report all calls and stratify them into pass (which can be considered as high confidence) and non-pass (low confidence), while others report only pass calls. Filled circles and solid lines correspond to calls with PASS or “.” in the VCF FILTER field. Open circles and dashed lines correspond to including all calls (passed or not). The ideal caller would have a dot close with precision close to 100% and a dot as far right as possible. Each plot corresponds to a distinct human cell line and truth set: a NA12878: 50× coverage, 2 × 101 bp, hg19, 319 bp median fragment length; b synthetic diploid CHM1/CHM13: 80× coverage, 2 × 151 bp, hg38, 345 bp median fragment length; c HG002 60× coverage, 2 × 151 bp, hg19, 555 bp median fragment length. HG002 calls have been filtered to only regions that the truth set defines as high confidence

**Fig. 2**
Callers perform poorly near single-nucleotide variants (SNVs) and indels, near low complexity and simple tandem repeat regions, and in detecting small events. Precision vs. the number of true positives (see Fig. 1) with calls stratified by: a the number of SNVs and indels within 50 bp of the variant breakpoint; b RepeatMasker/Tandem Repeat Finder annotation of breakpoint location. Similar repeat classes have been merged for clarity; c the size of the event. Open circles and dashed lines correspond to including all calls (passed or not); filled circles and solid lines correspond to calls with PASS or “.” in the VCF FILTER field and are indicate of high-confidence calls

**Fig. 3**
Simple ensemble-based calling does not reliably improve performance. a Agreement between callers in NA12878. For each caller, true-positive calls are stratified and shaded by the number of other callers in agreement. Both the full call set (blue) and the subset passing all caller-defined filters (green) are reported for callers that report filtered variants. The top bar shows the stratification of the truth set by number of callers detecting the variant. b Agreement between callers for false positives. c Precision vs. number of true positives/recall for all possible m-of-n ensembles (grey points) compared with individual callers (larger circles). 1 of 5, 2 of 3, and 4 of 5 ensembles are highlighted in colour

**Fig. 4**
Calls with the highest read depth or quality score are often false positives. For each caller, the results for the NA12878 dataset were separated into 100 bins by either (log) read count or quality score, as indicated. For each bin, the precision (upper plot) and number of calls falling within the bin (lower plot) was calculated. Grey bars indicate 95% binomial confidence intervals for the precision. Bins with 10 or fewer calls are coloured grey and confidence interval bar omitted

See this image and copyright information in PMC

References

1. Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat. Rev. Genet. 2006;7:85–97. doi: 10.1038/nrg1767. - DOI - PubMed
1. Baker M. Structural variation: the genome’s hidden architecture. Nat. Methods. 2012;9:133–7. doi: 10.1038/nmeth.1858. - DOI - PubMed
1. Garsed DW, et al. The architecture and evolution of cancer neochromosomes. Cancer Cell. 2014;26:653–67.. doi: 10.1016/j.ccell.2014.09.010. - DOI - PubMed
1. Lupski JR. Charcot–Marie–Tooth polyneuropathy: duplication, gene dosage, and genetic heterogeneity. Pediatr. Res. 1999;45:159–65.. doi: 10.1203/00006450-199902000-00001. - DOI - PubMed
1. Weiss LA, et al. Association between microdeletion and microduplication at 16p11.2 and autism. N. Engl. J. Med. 2008;358:667–75.. doi: 10.1056/NEJMoa075974. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- Coriell Cell Repositories

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software

Affiliations

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials