Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 14;13(24):6283.
doi: 10.3390/cancers13246283.

A Comparison of Tools for Copy-Number Variation Detection in Germline Whole Exome and Whole Genome Sequencing Data

Affiliations

A Comparison of Tools for Copy-Number Variation Detection in Germline Whole Exome and Whole Genome Sequencing Data

Migle Gabrielaite et al. Cancers (Basel). .

Abstract

Copy-number variations (CNVs) have important clinical implications for several diseases and cancers. Relevant CNVs are hard to detect because common structural variations define large parts of the human genome. CNV calling from short-read sequencing would allow single protocol full genomic profiling. We reviewed 50 popular CNV calling tools and included 11 tools for benchmarking in a reference cohort encompassing 39 whole genome sequencing (WGS) samples paired current clinical standard-SNP-array based CNV calling. Additionally, for nine samples we also performed whole exome sequencing (WES), to address the effect of sequencing protocol on CNV calling. Furthermore, we included Gold Standard reference sample NA12878, and tested 12 samples with CNVs confirmed by multiplex ligation-dependent probe amplification (MLPA). Tool performance varied greatly in the number of called CNVs and bias for CNV lengths. Some tools had near-perfect recall of CNVs from arrays for some samples, but poor precision. Several tools had better performance for NA12878, which could be a result of overfitting. We suggest combining the best tools also based on different methodologies: GATK gCNV, Lumpy, DELLY, and cn.MOPS. Reducing the total number of called variants could potentially be assisted by the use of background panels for filtering of frequently called variants.

Keywords: benchmark; bioinformatics; copy-number variation (CNV); structural variant; whole exome sequencing (WES); whole genome sequencing (WGS).

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Schematic visualization of different approaches for calling CNVs from NGS data. RD detects local difference in read-depth, SR detects unmatched read pairs, RP detects decreased insert size or swapped read directions between read pairs, and AS performs de novo assembly to best explain read distribution.
Figure 2
Figure 2
Overview of methods CNV calling tools applies, input NGS data, citation number from Google Scholar and available latest version for each tool as of March 2019. Tools highlighted with bold font are included in the benchmark, the horizontal red line shows the cutoff for the citation number.
Figure 3
Figure 3
(A) Number of duplications and deletions called by CNV calling tools in WES and WGS data for the NA12878 sample. (B) Number CNVs called by all tools in WES and WGS data for the NA12878 sample colored by length. (C) Box plots and scatter plots for recall and precision results for 11 CNV calling tools.
Figure 4
Figure 4
Recall and precision curves for GB01-08 and NA12878 whole exome sequencing samples, and GB01-GB38 and NA12878 whole genome sequencing samples.
Figure 5
Figure 5
Heatmap showing all called CNVs across all samples (A,B) and called CNVs overlap with the true CNVs (CE). (A) Whole genome sequencing (WGS; n = 407,671) and (B) Whole exome sequencing level (WES; n = 9944). Each row represents a tool, and a blue field denotes a call of the given CNV. All CNVs from each sample were merged across tools, such that any overlapping calls of either duplications or deletions were combined to one. Blue color denotes that the given CNV was called by the tool. The order of rows/columns for WES data and rows for WGS data was determined using complete-linkage hierarchical clustering with Euclidean distance, while the order of columns for WGS data was determined using a combination of k-means and hierarchical clustering due to memory restrictions. Darker grey coloring (WGS only) indicates that the tool was not run for the sample which contained the CNV. (C) 2076 WGS-based and (D) 81 WES-based true CNVs in NA12878 sample. The order of rows/columns was determined using complete- linkage hierarchical clustering with Euclidean distance. (E) CNV calling heatmap for 471 true CNVs at and WGS level in 38 samples (GB01-38). Column dendrogram shows clustering to the level of 20 clusters to reduce complexity. The Quality annotation represents the probe median score from CytoScan HD SNP-array and the Man.annot. refers to whether the CNV was independently manually confirmed. A positive quality score corresponds to duplications, and negative scores denote deletions. Darker grey coloring indicates that the tool was not run for the sample which contained the CNV. The order of rows/columns was determined using complete-linkage hierarchical clustering with Euclidean distance.
Figure 6
Figure 6
(A) CNV calling heatmap for 7 tools and 107 true CNVs at whole exome sequencing level in 8 samples (GB01-08). The Quality annotation represents the probe median score from CytoScan HD SNP-array and the Man.annot. refers to whether the CNV was independently manually confirmed. A positive quality score corresponds to duplications, and negative scores denote deletions. The order of rows/columns was determined using complete-linkage hierarchical clustering with Euclidean distance. (B) MLPA-confirmed CNV calling results for 11 CNV calling tools. GATK gCNV is labeled as GermlineCNVCaller.
Figure 7
Figure 7
Maximum memory used by a tool measured in megabytes and total CPU time in hours run in 28-core machines with 128 GB RAM, while running NA12878. Some tools can distribute tasks over nodes, and total RAM usage is reported as total maximum.

References

    1. Ionita-Laza I., Rogers A.J., Lange C., Raby B.A., Lee C. Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics. 2009;93:22–26. doi: 10.1016/j.ygeno.2008.08.012. - DOI - PMC - PubMed
    1. Sebat J., Lakshmi B., Troge J., Alexander J., Young J., Lundin P., Månér S., Massa H., Walker M., Chi M., et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:525–528. doi: 10.1126/science.1098918. - DOI - PubMed
    1. Takumi T., Tamada K. CNV biology in neurodevelopmental disorders. Curr. Opin. Neurobiol. 2018;48:183–192. doi: 10.1016/j.conb.2017.12.004. - DOI - PubMed
    1. Kumaran M., Cass C.E., Graham K., Mackey J.R., Hubaux R., Lam W., Yasui Y., Damaraju S. Germline copy number variations are associated with breast cancer risk and prognosis. Sci. Rep. 2017;7:14621. doi: 10.1038/s41598-017-14799-7. - DOI - PMC - PubMed
    1. Iafrate A.J., Feuk L., Rivera M.N., Listewnik M.L., Donahoe P.K., Qi Y., Scherer S.W., Lee C. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. doi: 10.1038/ng1416. - DOI - PubMed

LinkOut - more resources