. 2023 Sep;55(9):1589-1597.

doi: 10.1038/s41588-023-01449-0. Epub 2023 Aug 21.

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Mehrtash Babadi^#¹, Jack M Fu^#^{2

3

4}, Samuel K Lee^#⁵, Andrey N Smirnov^#⁵, Laura D Gauthier⁵, Mark Walker^{5

3}, David I Benjamin⁵, Xuefang Zhao^{2

3

4}, Konrad J Karczewski^{2

6

7}, Isaac Wong^{2

3}, Ryan L Collins^{2

3}, Alba Sanchis-Juan^{2

3

4}, Harrison Brand^{2

3

4}, Eric Banks⁵, Michael E Talkowski^{8

9

10

11

12}

Affiliations

¹ Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA. mehrtash@broadinstitute.org.
² Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
³ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁴ Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
⁵ Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁶ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
⁷ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁸ Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA. talkowsk@broadinstitute.org.
⁹ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA. talkowsk@broadinstitute.org.
¹⁰ Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA. talkowsk@broadinstitute.org.
¹¹ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. talkowsk@broadinstitute.org.
¹² Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA. talkowsk@broadinstitute.org.

^# Contributed equally.

PMID: 37604963
PMCID: PMC10904014
DOI: 10.1038/s41588-023-01449-0

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Mehrtash Babadi et al. Nat Genet. 2023 Sep.

. 2023 Sep;55(9):1589-1597.

doi: 10.1038/s41588-023-01449-0. Epub 2023 Aug 21.

Authors

Affiliations

¹ Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA. mehrtash@broadinstitute.org.
² Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
³ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁴ Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
⁵ Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁶ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
⁷ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁸ Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA. talkowsk@broadinstitute.org.
⁹ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA. talkowsk@broadinstitute.org.
¹⁰ Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA. talkowsk@broadinstitute.org.
¹¹ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. talkowsk@broadinstitute.org.
¹² Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA. talkowsk@broadinstitute.org.

^# Contributed equally.

PMID: 37604963
PMCID: PMC10904014
DOI: 10.1038/s41588-023-01449-0

Erratum in

Author Correction: GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data.
Babadi M, Fu JM, Lee SK, Smirnov AN, Gauthier LD, Walker M, Benjamin DI, Zhao X, Karczewski KJ, Wong I, Collins RL, Sanchis-Juan A, Brand H, Banks E, Talkowski ME. Babadi M, et al. Nat Genet. 2024 Mar;56(3):553. doi: 10.1038/s41588-024-01663-4. Nat Genet. 2024. PMID: 38263447 No abstract available.

Abstract

Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo coding CNVs on genome organization and human disease, we developed GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. We benchmarked GATK-gCNV in 7,962 exomes from individuals in quartet families with matched genome sequencing and microarray data, finding up to 95% recall of rare coding CNVs at a resolution of more than two exons. We used GATK-gCNV to generate a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits. In summary, GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement

The authors declare no competing interests.

Figures

**Fig. 1**
GATK-gCNV pipeline steps. a, Coverage information is collected from genome-aligned reads over a set of predefined genomic intervals. b, The original interval list is filtered to remove coverage outliers, unmappable genomic sequence, and regions of segmental duplications. c, Samples are clustered into batches based on read-depth profile similarity and each batch is processed separately. d, Chromosomal ploidies are inferred using total read-depth of each chromosome. e, The GATK-gCNV model learns read-depth bias and noise and iteratively updates copy number state posterior probabilities until a self-consistent state is obtained; after convergence, constant copy number segments are found using the Viterbi algorithm along with segmentation quality scores. **Abbreviations:** CN - copy number; QS - quality score.

**Fig. 2.**
Calling and benchmarking of GATK-gCNV callset in a cohort of more than 7,000 samples with matching deep WGS sequencing. a, A heatmap illustration of the distinct read-count signal of the 7,981 intervals chosen for the batch creation procedure. b, After normalizing for median read count, the first three PCs are clustered to determine which samples will be processed together with GATK-gCNV, colored by the assigned batch. c, For each of the 14 batches generated, a random subset of 200 samples was chosen to generate a read-count model using cohort-mode; the remaining samples were processed in case-mode. d, The recall (and e, precision) of rare CNVs in GATK-gCNV ES CNVs compared to WGS gold-standard CNVs as a function of the number of exons the variant spans. f, The recall (and g, precision) of de novo CNVs in GATK-gCNV compared to gold-standard WGS CNVs as a function of the number of exons. h, The recall (and i, precision) of rare CNVs in GATK-gCNV, XHMM, CONIFER, cn.mops, and ExomeDepth WES CNVs compared to WGS gold-standard CNVs as a function of the number of exons the variant spans. **Abbreviations**: PCA - principal component analysis; WES - exome sequencing WGS - whole genome sequencing.

**Fig. 3.**
A high-quality rare CNV callset was generated on 200,624 exomes from the UK Biobank (UKBB) using GATK-gCNV a, The variant-size distribution of high-quality, rare CNVs in the UKBB as a function of the number of exons each variant spans. b, The distribution of the number of rare, high-quality CNVs per-sample in the UKBB. c, Using 177,158 UKBB samples with matching CMA data, we find excellent validation of high-quality GATK-gCNV WES calls using Genome STRiP Intensity Rank Sum testing. d, GD CNV rates in the UKBB GATK-gCNV WES callset were highly concordant with rates from previous reports based on UKBB CMA data. e, The number of rare deletions observed over a gene in the UKBB GATK-gCNV callset is tightly correlated with LOEUF, with grey band representing LOESS smoothing of the 95% confidence intervals on corresponding point estimates. f, The number of rare duplications observed over a gene in the UKBB GATK-gCNV callset is also strongly correlated with the pTriplo score measuring intolerance to duplications, with grey band representing LOESS smoothing of the 95% confidence intervals on corresponding point estimates. g, The number of high-confidence duplications (IED) with both breakpoints within the boundaries of a gene are also correlated with LOEUF, with grey band representing LOESS smoothing of the 95% confidence intervals on corresponding point estimates. h, 16p11.2 deletions are associated with a significant increase in normalized BMI (n=41 carried a CN=1 deletion, n=61 carried a CN=3 duplication, and 169,711 individuals copy normal; boxplot corresponding to first, second, and third quartile of data, with whiskers denoting 1.5x interquartile range). i, PDZK1 deletions are associated with a significant decrease in normalized urate levels (n=145 carried a CN=1 deletion overlapping, n=143 carried a duplication of CN=3 overlapping, and 161,773 individuals copy norma; boxplot corresponding to first, second, and third quartile of data, with whiskers denoting 1.5x interquartile range l). j, CST3 duplications are significantly associated with decreased normalized eGFR values (n=6 carried a CN=3 duplication overlapping, n=3 carried a CN=3 duplication overlapping, and 162,666 individuals copy normal; boxplot corresponding to first, second, and third quartile of data, with whiskers denoting 1.5x interquartile range), on par with eGFR of individuals with renal failure (n=5,455). **Abbreviations:** CNV - copy number variation; DEL - deletion; DUP - duplication; CMA - chromosomal microarray; UKBB - UK Biobank; LOEUF - loss-of-function observed over expected upper bound fraction; pTriplo - probability of triplosensitivity; IED - intragenic exonic duplication; GD - genomic disorder; WES - exome sequencing; BMI - body mass index; eGFR - estimated glomerular filtration rate.

See this image and copyright information in PMC

References

1. Marshall CR et al. Structural variation of chromosomes in autism spectrum disorder. Am. J. Hum. Genet 82, 477–488 (2008). - PMC - PubMed
1. Egolf LE et al. Germline 16p11.2 Microdeletion Predisposes to Neuroblastoma. Am. J. Hum. Genet 105, 658–668 (2019). - PMC - PubMed
1. Ebert P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, (2021). - PMC - PubMed
1. Ruderfer DM et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat. Genet 48, 1107–1111 (2016). - PMC - PubMed
1. Miller DT et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am. J. Hum. Genet 86, 749–764 (2010). - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Affiliations

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources