. 2022 Apr;32(4):766-777.

doi: 10.1101/gr.275995.121. Epub 2022 Feb 23.

A framework to score the effects of structural variants in health and disease

Philip Kleinert¹, Martin Kircher^{1

2}

Affiliations

¹ Berlin Institute of Health (BIH) at Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany.
² Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, 23562 Lübeck, Germany.

PMID: 35197310
PMCID: PMC8997355
DOI: 10.1101/gr.275995.121

A framework to score the effects of structural variants in health and disease

Philip Kleinert et al. Genome Res. 2022 Apr.

. 2022 Apr;32(4):766-777.

doi: 10.1101/gr.275995.121. Epub 2022 Feb 23.

Authors

Philip Kleinert¹, Martin Kircher^{1

2}

Affiliations

¹ Berlin Institute of Health (BIH) at Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany.
² Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, 23562 Lübeck, Germany.

PMID: 35197310
PMCID: PMC8997355
DOI: 10.1101/gr.275995.121

Abstract

Although technological advances improved the identification of structural variants (SVs) in the human genome, their interpretation remains challenging. Several methods utilize individual mechanistic principles like the deletion of coding sequence or 3D genome architecture disruptions. However, a comprehensive tool using the broad spectrum of available annotations is missing. Here, we describe CADD-SV, a method to retrieve and integrate a wide set of annotations to predict the effects of SVs. Previously, supervised learning approaches were limited due to a small number and biased set of annotated pathogenic or benign SVs. We overcome this problem by using a surrogate training objective, the Combined Annotation Dependent Depletion (CADD) of functional variants. We use human- and chimpanzee-derived SVs as proxy-neutral and contrast them with matched simulated variants as proxy-deleterious, an approach that has proven powerful for short sequence variants. Our tool computes summary statistics over diverse variant annotations and uses random forest models to prioritize deleterious structural variants. The resulting CADD-SV scores correlate with known pathogenic and rare population variants. We further show that we can prioritize somatic cancer variants as well as noncoding variants known to affect gene expression. We provide a website and offline-scoring tool for easy application of CADD-SV.

PubMed Disclaimer

Figures

**Figure 1.**
Workflow and training data sets of the CADD-SV framework. (A) Proxy-neutral training data set of CADD-SV. Human- and chimpanzee-derived structural variants (SVs) are considered to be neutral or beneficial if they reached fixation. Therefore, previously identified human- and chimpanzee-derived SVs (Kronenberg et al. 2018) are used as a proxy-neutral training data set. (B) CADD-SV workflow. Size- and length-matched simulated variants are used as a proxy-deleterious training data set. Next, various informative features are annotated and transformed (see Methods; Supplemental Table 1) across span or flank of the variants to train multiple random forest classifiers. Models are used to score user-provided (novel) SVs. For this purpose, variants are annotated, features transformed, and models applied. The maximum value of the flank and span model scores is used as the raw model score. Further, a Phred transformation of the relative rank of the score among gnomAD-SVs provides an easy interpretation of the CADD-SV score. (C) Depiction of implementation of the four models generated from the proxy-neutral and proxy-deleterious variant sets. Whereas deletion of a novel sequence provides information about the deleted sequence in the human genome build, the insertion model relies on the site of integration. Therefore, flanking regions to the SVs are taken into account.

**Figure 2.**
Performance of random forest models trained on proxy-deleterious and proxy-benign SVs. (A) All models show a nonrandom separation of the two classes in a random 10% holdout. Performance is measured as sensitivity over false positive rate (FPR). Note that all training data sets contain a high amount of mislabeled SVs, as a majority of proxy-deleterious SVs is likely to be neutral. (B) Model predictions of the chimpanzee deletion model are shifted toward high-impact SVs in the simulated set of chimpanzee deletions. (C) Representation of feature importance in the chimpanzee deletion random forest model. Note that proxy-pathogenic and proxy-benign sets are length-matched and that length is not used as an explicit feature. Most important contributions come from species conservation (e.g., GERP, phastCons) but also from integrated scores (i.e., CADD or LINSIGHT). Epigenetic features as well as 3D genome architecture features, such as the Directionality Index derived from Hi-C data, also contribute to the most informative features of the models. For a full list of features and explanation of their naming, see Supplemental Table 1.

**Figure 3.**
Validation set performance of the random forest models. (A) Summary of the performance of CADD-SV scores compared to SVScore, AnnotSV, and TAD-fusion scores across three validation sets (pathogenic variants, cancer variants, and putative eQTL SVs) for deletions, duplications, and insertions. (B) Rank of ClinVar pathogenic SVs added to SVs of healthy individuals from the 1000 Genomes Project. CADD-SV prioritizes the pathogenic SVs over the other SVs in a single simulated patient, scoring pathogenic variants in the top fifth percentile of deletions, duplications, and insertions for 65.9%, 74.7%, and 100% of simulated variant sets, respectively. (C) CADD-SV score distribution as a function of gnomAD allele frequency. Higher CADD-SV values represent an increased likelihood to be deleterious. In the deleterious tail of the score distribution, there is an excess of singletons (shown in red; bin size 0.025), which hints at negative selection against deleterious deletions. (D–F) CADD-SV performance of various validation sets compared to common gnomAD SVs (AF ≥ 0.05). Performance is measured as sensitivity over false positive rate. CADD-SV is able to identify ClinVar pathogenic SVs (n = 3262 deletions, 82 duplications, and 78 insertions, pale red) as well as SVs reported in the ICGC cancer cohort (n = 52,677 deletions, 42,972 duplications, and 18 insertions, dark red) from common SVs in gnomAD. Further, CADD-SV can identify noncoding SVs that are associated with differences in gene expression (turquoise). CADD-SV scores (solid lines) are compared to SVScore (dashed lines), AnnotSV (dotted lines), and TAD-fusion (dashed and dotted lines) for deletions (D), duplications (E), and insertions (F).

**Figure 4.**
Prioritizing functional variants with CADD-SV. (A) Screenshot of UCSC Genome Browser tracks of a region (Chr 4: 73,004,055–73,231,324) deleted in one individual present in the gnomAD-SV cohort. Two genes are affected, with *ANKRD17* variants being reported as causal for the autosomal dominant Chopra-Amiel-Gordon syndrome (CAGS). Various pathogenic SNVs were identified within the gene body of *ANKRD17* and are marked in red in the UCSC ClinVar track. CAGS patients are characterized by developmental delay and moderate to severe intellectual disability. Further, various positions of this SV are highly conserved among 100 vertebrate genomes, contributing to CADD-SV's power of ranking it as a putatively deleterious variant. (B) Phred-scaled CADD-SV score distribution as a function of number of genome-wide association study-identified SNVs per deletion from gnomAD-SV. Especially among high scoring SVs, the average number of GWAS-associated SNVs increases drastically, suggesting functional variants in the pathogenic tail of the CADD-SV score distribution. (C) Scoring deletions under natural selection from Ebert et al. (2021). Shown are score distributions for the functional set (blue) against the same number of randomly drawn SVs from the 1000 Genomes Project. Note that we report Phred-scaled CADD-SV scores (log₁₀ scale) with high values corresponding to high deleteriousness.

**Figure 5.**
The CADD-SV web server can score custom SV sets, but it can also be used for direct lookup of prescored deletions, duplications, and insertions from gnomAD and ClinVar, as well as call-sets from Abel et al. (2020) and Beyter et al. (2021). For a given SV, the website provides the combined model scores as well as annotation values normalized to the range in the healthy gnomAD cohort (Z-score). This enables users to identify interesting variants from color-highlighted extreme feature values and not just by the combined CADD-SV score. Further, the website provides direct links for each SV to external resources like gnomAD, Ensembl, or the UCSC Genome Browser.

See this image and copyright information in PMC

Cited by

DeepSVP: integration of genotype and phenotype for structural variant prioritization using deep learning.
Althagafi A, Alsubaie L, Kathiresan N, Mineta K, Aloraini T, Al Mutairi F, Alfadhel M, Gojobori T, Alfares A, Hoehndorf R. Althagafi A, et al. Bioinformatics. 2022 Mar 4;38(6):1677-1684. doi: 10.1093/bioinformatics/btab859. Bioinformatics. 2022. PMID: 34951628 Free PMC article.
Rare pathogenic structural variants show potential to enhance prostate cancer germline testing for African men.
Hayes V, Gong T, Jiang J, Bornman R, Gheybi K, Stricker P, Weischenfeldt J, Mutambirwa S. Hayes V, et al. Res Sq [Preprint]. 2024 Jun 13:rs.3.rs-4531885. doi: 10.21203/rs.3.rs-4531885/v1. Res Sq. 2024. Update in: Nat Commun. 2025 Mar 10;16(1):2400. doi: 10.1038/s41467-025-57312-9. PMID: 38947031 Free PMC article. Updated. Preprint.
Rare pathogenic structural variants show potential to enhance prostate cancer germline testing for African men.
Gong T, Jiang J, Uthayopas K, Bornman MSR, Gheybi K, Stricker PD, Weischenfeldt J, Mutambirwa SBA, Jaratlerdsiri W, Hayes VM. Gong T, et al. Nat Commun. 2025 Mar 10;16(1):2400. doi: 10.1038/s41467-025-57312-9. Nat Commun. 2025. PMID: 40064858 Free PMC article.
Rare diseases: human genome research is coming home.
Ropers HH, van Karnebeek CD. Ropers HH, et al. Cold Spring Harb Mol Case Stud. 2022 Mar 24;8(2):a006210. doi: 10.1101/mcs.a006210. Print 2022 Feb. Cold Spring Harb Mol Case Stud. 2022. PMID: 35332074 Free PMC article.
TADA-a machine learning tool for functional annotation-based prioritisation of pathogenic CNVs.
Hertzberg J, Mundlos S, Vingron M, Gallone G. Hertzberg J, et al. Genome Biol. 2022 Mar 1;23(1):67. doi: 10.1186/s13059-022-02631-z. Genome Biol. 2022. PMID: 35232478 Free PMC article.

See all "Cited by" articles

References

1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. 10.1038/nature15393 - DOI - PMC - PubMed
1. Abel HJ, Larson DE, Regier AA, Chiang C, Das I, Kanchi KL, Layer RM, Neale BM, Salerno WJ, Reeves C, et al. 2020. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583: 83–89. 10.1038/s41586-020-2371-0 - DOI - PMC - PubMed
1. Abugessaisa I, Noguchi S, Hasegawa A, Harshbarger J, Kondo A, Lizio M, Severin J, Carninci P, Kawaji H, Kasukawa T. 2017. FANTOM5 CAGE profiles of human and mouse reprocessed for GRCh38 and GRCm38 genome assemblies. Sci Data 4: 170107. 10.1038/sdata.2017.107 - DOI - PMC - PubMed
1. Beyter D, Ingimundardottir H, Oddsson A, Eggertsson HP, Bjornsson E, Jonsson H, Atlason BA, Kristmundsdottir S, Mehringer S, Hardarson MT, et al. 2021. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat Genet 53: 779–786. 10.1038/s41588-021-00865-4 - DOI - PubMed
1. Calandrelli R, Wu Q, Guan J, Zhong S. 2018. GITAR: an open source tool for analysis and visualization of Hi-C data. Genomics Proteomics Bioinformatics 16: 365–372. 10.1016/j.gpb.2018.06.006 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A framework to score the effects of structural variants in health and disease

Affiliations

A framework to score the effects of structural variants in health and disease

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources