Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr;32(4):766-777.
doi: 10.1101/gr.275995.121. Epub 2022 Feb 23.

A framework to score the effects of structural variants in health and disease

Affiliations

A framework to score the effects of structural variants in health and disease

Philip Kleinert et al. Genome Res. 2022 Apr.

Abstract

Although technological advances improved the identification of structural variants (SVs) in the human genome, their interpretation remains challenging. Several methods utilize individual mechanistic principles like the deletion of coding sequence or 3D genome architecture disruptions. However, a comprehensive tool using the broad spectrum of available annotations is missing. Here, we describe CADD-SV, a method to retrieve and integrate a wide set of annotations to predict the effects of SVs. Previously, supervised learning approaches were limited due to a small number and biased set of annotated pathogenic or benign SVs. We overcome this problem by using a surrogate training objective, the Combined Annotation Dependent Depletion (CADD) of functional variants. We use human- and chimpanzee-derived SVs as proxy-neutral and contrast them with matched simulated variants as proxy-deleterious, an approach that has proven powerful for short sequence variants. Our tool computes summary statistics over diverse variant annotations and uses random forest models to prioritize deleterious structural variants. The resulting CADD-SV scores correlate with known pathogenic and rare population variants. We further show that we can prioritize somatic cancer variants as well as noncoding variants known to affect gene expression. We provide a website and offline-scoring tool for easy application of CADD-SV.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Workflow and training data sets of the CADD-SV framework. (A) Proxy-neutral training data set of CADD-SV. Human- and chimpanzee-derived structural variants (SVs) are considered to be neutral or beneficial if they reached fixation. Therefore, previously identified human- and chimpanzee-derived SVs (Kronenberg et al. 2018) are used as a proxy-neutral training data set. (B) CADD-SV workflow. Size- and length-matched simulated variants are used as a proxy-deleterious training data set. Next, various informative features are annotated and transformed (see Methods; Supplemental Table 1) across span or flank of the variants to train multiple random forest classifiers. Models are used to score user-provided (novel) SVs. For this purpose, variants are annotated, features transformed, and models applied. The maximum value of the flank and span model scores is used as the raw model score. Further, a Phred transformation of the relative rank of the score among gnomAD-SVs provides an easy interpretation of the CADD-SV score. (C) Depiction of implementation of the four models generated from the proxy-neutral and proxy-deleterious variant sets. Whereas deletion of a novel sequence provides information about the deleted sequence in the human genome build, the insertion model relies on the site of integration. Therefore, flanking regions to the SVs are taken into account.
Figure 2.
Figure 2.
Performance of random forest models trained on proxy-deleterious and proxy-benign SVs. (A) All models show a nonrandom separation of the two classes in a random 10% holdout. Performance is measured as sensitivity over false positive rate (FPR). Note that all training data sets contain a high amount of mislabeled SVs, as a majority of proxy-deleterious SVs is likely to be neutral. (B) Model predictions of the chimpanzee deletion model are shifted toward high-impact SVs in the simulated set of chimpanzee deletions. (C) Representation of feature importance in the chimpanzee deletion random forest model. Note that proxy-pathogenic and proxy-benign sets are length-matched and that length is not used as an explicit feature. Most important contributions come from species conservation (e.g., GERP, phastCons) but also from integrated scores (i.e., CADD or LINSIGHT). Epigenetic features as well as 3D genome architecture features, such as the Directionality Index derived from Hi-C data, also contribute to the most informative features of the models. For a full list of features and explanation of their naming, see Supplemental Table 1.
Figure 3.
Figure 3.
Validation set performance of the random forest models. (A) Summary of the performance of CADD-SV scores compared to SVScore, AnnotSV, and TAD-fusion scores across three validation sets (pathogenic variants, cancer variants, and putative eQTL SVs) for deletions, duplications, and insertions. (B) Rank of ClinVar pathogenic SVs added to SVs of healthy individuals from the 1000 Genomes Project. CADD-SV prioritizes the pathogenic SVs over the other SVs in a single simulated patient, scoring pathogenic variants in the top fifth percentile of deletions, duplications, and insertions for 65.9%, 74.7%, and 100% of simulated variant sets, respectively. (C) CADD-SV score distribution as a function of gnomAD allele frequency. Higher CADD-SV values represent an increased likelihood to be deleterious. In the deleterious tail of the score distribution, there is an excess of singletons (shown in red; bin size 0.025), which hints at negative selection against deleterious deletions. (DF) CADD-SV performance of various validation sets compared to common gnomAD SVs (AF ≥ 0.05). Performance is measured as sensitivity over false positive rate. CADD-SV is able to identify ClinVar pathogenic SVs (n = 3262 deletions, 82 duplications, and 78 insertions, pale red) as well as SVs reported in the ICGC cancer cohort (n = 52,677 deletions, 42,972 duplications, and 18 insertions, dark red) from common SVs in gnomAD. Further, CADD-SV can identify noncoding SVs that are associated with differences in gene expression (turquoise). CADD-SV scores (solid lines) are compared to SVScore (dashed lines), AnnotSV (dotted lines), and TAD-fusion (dashed and dotted lines) for deletions (D), duplications (E), and insertions (F).
Figure 4.
Figure 4.
Prioritizing functional variants with CADD-SV. (A) Screenshot of UCSC Genome Browser tracks of a region (Chr 4: 73,004,055–73,231,324) deleted in one individual present in the gnomAD-SV cohort. Two genes are affected, with ANKRD17 variants being reported as causal for the autosomal dominant Chopra-Amiel-Gordon syndrome (CAGS). Various pathogenic SNVs were identified within the gene body of ANKRD17 and are marked in red in the UCSC ClinVar track. CAGS patients are characterized by developmental delay and moderate to severe intellectual disability. Further, various positions of this SV are highly conserved among 100 vertebrate genomes, contributing to CADD-SV's power of ranking it as a putatively deleterious variant. (B) Phred-scaled CADD-SV score distribution as a function of number of genome-wide association study-identified SNVs per deletion from gnomAD-SV. Especially among high scoring SVs, the average number of GWAS-associated SNVs increases drastically, suggesting functional variants in the pathogenic tail of the CADD-SV score distribution. (C) Scoring deletions under natural selection from Ebert et al. (2021). Shown are score distributions for the functional set (blue) against the same number of randomly drawn SVs from the 1000 Genomes Project. Note that we report Phred-scaled CADD-SV scores (log10 scale) with high values corresponding to high deleteriousness.
Figure 5.
Figure 5.
The CADD-SV web server can score custom SV sets, but it can also be used for direct lookup of prescored deletions, duplications, and insertions from gnomAD and ClinVar, as well as call-sets from Abel et al. (2020) and Beyter et al. (2021). For a given SV, the website provides the combined model scores as well as annotation values normalized to the range in the healthy gnomAD cohort (Z-score). This enables users to identify interesting variants from color-highlighted extreme feature values and not just by the combined CADD-SV score. Further, the website provides direct links for each SV to external resources like gnomAD, Ensembl, or the UCSC Genome Browser.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. 10.1038/nature15393 - DOI - PMC - PubMed
    1. Abel HJ, Larson DE, Regier AA, Chiang C, Das I, Kanchi KL, Layer RM, Neale BM, Salerno WJ, Reeves C, et al. 2020. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583: 83–89. 10.1038/s41586-020-2371-0 - DOI - PMC - PubMed
    1. Abugessaisa I, Noguchi S, Hasegawa A, Harshbarger J, Kondo A, Lizio M, Severin J, Carninci P, Kawaji H, Kasukawa T. 2017. FANTOM5 CAGE profiles of human and mouse reprocessed for GRCh38 and GRCm38 genome assemblies. Sci Data 4: 170107. 10.1038/sdata.2017.107 - DOI - PMC - PubMed
    1. Beyter D, Ingimundardottir H, Oddsson A, Eggertsson HP, Bjornsson E, Jonsson H, Atlason BA, Kristmundsdottir S, Mehringer S, Hardarson MT, et al. 2021. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat Genet 53: 779–786. 10.1038/s41588-021-00865-4 - DOI - PubMed
    1. Calandrelli R, Wu Q, Guan J, Zhong S. 2018. GITAR: an open source tool for analysis and visualization of Hi-C data. Genomics Proteomics Bioinformatics 16: 365–372. 10.1016/j.gpb.2018.06.006 - DOI - PMC - PubMed

Publication types

LinkOut - more resources