Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 9;21(1):274.
doi: 10.1186/s13059-020-02178-x.

SVFX: a machine learning framework to quantify the pathogenicity of structural variants

Affiliations

SVFX: a machine learning framework to quantify the pathogenicity of structural variants

Sushant Kumar et al. Genome Biol. .

Abstract

There is a lack of approaches for identifying pathogenic genomic structural variants (SVs) although they play a crucial role in many diseases. We present a mechanism-agnostic machine learning-based workflow, called SVFX, to assign pathogenicity scores to somatic and germline SVs. In particular, we generate somatic and germline training models, which include genomic, epigenomic, and conservation-based features, for SV call sets in diseased and healthy individuals. We then apply SVFX to SVs in cancer and other diseases; SVFX achieves high accuracy in identifying pathogenic SVs. Predicted pathogenic SVs in cancer cohorts are enriched among known cancer genes and many cancer-related pathways.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Machine learning-based workflow of SVFX to identify pathogenic SVs. The original SV dataset consists of disease/case and control SVs. In our somatic model, disease SVs correspond to somatic SVs found in a cancer cohort and control SVs correspond to SVs found in the 1KG SVs. We randomly select SVs from the 1KG SV dataset such that the number of somatic SVs and control SVs matches. Similarly, for our germline model, we have (1) disease germline SVs identified in a specific disease cohort and (2) control SVs that correspond to common SVs in the 1KG SV dataset. For both germline and somatic models, we generate 1000 random iterations of the original disease and control dataset. These permuted SVs are later utilized for generating a Z-score-normalized feature matrix
Fig. 2
Fig. 2
Performance evaluation for somatic models to predict pathogenic SVs in various cancer types. This figure presents area auROCs based on the validation datasets for large deletions (a) and duplications (b) in six different cancer cohorts including breast adenocarcinoma (BRCA), esophageal carcinoma (ESCA), liver (LIHC), ovary (OV), skin melanoma (SKCM), and stomach (STAD) cancers. Similarly, auROC plots are shown for test datasets associated with large deletions (c) and duplications (d) in six different cancer cohorts. Finally, auROC plots are shown for pathogenic SVs in the validation (e) and testing dataset (f) in the ClinVar, CVD, and IBD cohort datasets
Fig. 3
Fig. 3
Orthogonal biological validations of somatic models in cancer. a This plot presents a mean conservation score comparison for genomic regions that overlap with predicted highly pathogenic deletions against benign deletions (left panel) and duplications (right panel) for a model where conservation was excluded from the original model. b This plot presents cancer gene enrichment values for coding regions that overlap with predicted highly pathogenic deletions against benign deletions (left panel) and duplications (right panel) for a model where overlap fraction with cancer genes was excluded from the original model. c Example showing the ubiquitin-mediated proteolysis pathway that is enriched among genes affected by highly pathogenic deletions on the pan-cancer level. Genes that are influenced by highly pathogenic deletions in this pathway are highlighted in red. d Example showing the adherens junction pathway, which is enriched among genes affected by highly pathogenic duplications on the pan-cancer level. Genes that are influenced by highly pathogenic duplications in this pathway are highlighted in red
Fig. 4
Fig. 4
Example of a highly pathogenic cancer deletion that influences coding and regulatory elements in the genome. This figure presents a highly pathogenic deletion that disrupts entirely or partially the coding and regulatory regions of three distinct genes: RIT1, SYT11, and GON4L. The regulatory elements are marked by peaks observed in the histone mark (H3K27ac) signals across multiple tissues. The Hi-C matrix plot shows the TAD boundaries disrupted by this deletion
Fig. 5
Fig. 5
Example of a highly pathogenic cancer duplication that influences coding and regulatory elements in the genome. This panel presents a highly pathogenic duplication that influences coding and regulatory elements for multiple genes, including BCL11B, SETD3, CCNK, and HHIPL1. Similar to pathogenic deletion Fig. 4, this panel also displays the Hi-C profile to highlight TAD boundaries that are disrupted by this highly pathogenic duplication

Similar articles

Cited by

References

    1. Brandler WM, Antaki D, Gujral M, Kleiber ML, Whitney J, Maile MS, et al. Paternally inherited cis-regulatory structural variants are associated with autism. Science. 2018;360:327–331. doi: 10.1126/science.aan2261. - DOI - PMC - PubMed
    1. Weischenfeldt J, Dubash T, Drainas AP, Mardin BR, Chen Y, Stütz AM, et al. Pan-cancer analysis of somatic copy-number alterations implicates IRS4 and IGF2 in enhancer hijacking. Nat Genet. 2017;49:65–74. doi: 10.1038/ng.3722. - DOI - PMC - PubMed
    1. Li Y, Roberts ND, Wala JA, Shapira O, Schumacher SE, Kumar K, et al. Patterns of somatic structural variation in human cancer genomes. Nature [Internet]. Nature Research. 2020;578:112–21. Available from: https://pubmed.ncbi.nlm.nih.gov/32025012/. [cited 2020 Oct 20]. - PMC - PubMed
    1. Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet. 2013;14:125–138. doi: 10.1038/nrg3373. - DOI - PubMed
    1. Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394. - DOI - PMC - PubMed

Publication types

LinkOut - more resources