. 2022 Jan 11;12(1):555.

doi: 10.1038/s41598-021-04505-z.

Automated prediction of the clinical impact of structural copy number variations

M Gažiová^#^{1

2}, T Sládeček^#¹, O Pös^{1

3

4}, M Števko¹, W Krampl^{1

3

4}, Z Pös^{1

3

5}, R Hekel^{1

4

6}, M Hlavačka¹, M Kucharík^{1

4}, J Radvánszky^{1

5

4}, J Budiš^{7

8

9}, T Szemes^{1

3

4}

Affiliations

¹ Geneton Ltd, 84104, Bratislava, Slovakia.
² Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, 84248, Bratislava, Slovakia.
³ Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 84215, Bratislava, Slovakia.
⁴ Comenius University Science Park, 84104, Bratislava, Slovakia.
⁵ Institute of Clinical and Translational Research, Biomedical Research Center, Slovak Academy of Sciences, 84505, Bratislava, Slovakia.
⁶ Slovak Center of Scientific and Technical Information, 81104, Bratislava, Slovakia.
⁷ Geneton Ltd, 84104, Bratislava, Slovakia. jaroslav.budis@geneton.sk.
⁸ Comenius University Science Park, 84104, Bratislava, Slovakia. jaroslav.budis@geneton.sk.
⁹ Slovak Center of Scientific and Technical Information, 81104, Bratislava, Slovakia. jaroslav.budis@geneton.sk.

^# Contributed equally.

PMID: 35017614
PMCID: PMC8752772
DOI: 10.1038/s41598-021-04505-z

Automated prediction of the clinical impact of structural copy number variations

M Gažiová et al. Sci Rep. 2022.

. 2022 Jan 11;12(1):555.

doi: 10.1038/s41598-021-04505-z.

Authors

Affiliations

¹ Geneton Ltd, 84104, Bratislava, Slovakia.
² Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, 84248, Bratislava, Slovakia.
³ Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 84215, Bratislava, Slovakia.
⁴ Comenius University Science Park, 84104, Bratislava, Slovakia.
⁵ Institute of Clinical and Translational Research, Biomedical Research Center, Slovak Academy of Sciences, 84505, Bratislava, Slovakia.
⁶ Slovak Center of Scientific and Technical Information, 81104, Bratislava, Slovakia.
⁷ Geneton Ltd, 84104, Bratislava, Slovakia. jaroslav.budis@geneton.sk.
⁸ Comenius University Science Park, 84104, Bratislava, Slovakia. jaroslav.budis@geneton.sk.
⁹ Slovak Center of Scientific and Technical Information, 81104, Bratislava, Slovakia. jaroslav.budis@geneton.sk.

^# Contributed equally.

PMID: 35017614
PMCID: PMC8752772
DOI: 10.1038/s41598-021-04505-z

Abstract

Copy number variants (CNVs) play an important role in many biological processes, including the development of genetic diseases, making them attractive targets for genetic analyses. The interpretation of the effect of these structural variants is a challenging problem due to highly variable numbers of gene, regulatory, or other genomic elements affected by the CNV. This led to the demand for the interpretation tools that would relieve researchers, laboratory diagnosticians, genetic counselors, and clinical geneticists from the laborious process of annotation and classification of CNVs. We designed and validated a prediction method (ISV; Interpretation of Structural Variants) that is based on boosted trees which takes into account annotations of CNVs from several publicly available databases. The presented approach achieved more than 98% prediction accuracy on both copy number loss and copy number gain variants while also allowing CNVs being assigned "uncertain" significance in predictions. We believe that ISV's prediction capability and explainability have a great potential to guide users to more precise interpretations and classifications of CNVs.

PubMed Disclaimer

Conflict of interest statement

All authors are employees of Geneton Ltd., where they also participate in the development of a commercial application for the annotation and interpretation of CNV. The presented method was filed as a patent application under the number PCT/EP2020/025292. Apart from the above-mentioned, all authors have declared no conflicts of interest.

Figures

**Figure 1**
Diagram depicting used datasets and preprocessing steps. In all analyses, we only evaluated CNVs larger than 1 Kbps. CNVs with a multiplicity of 1 for losses and multiplicity of 3 for gains and smaller than 5 Mbps from ClinVar were used for training, validation of models, and basic testing for the final evaluation of the chosen model. CNVs with other multiplicity were used as an additional testing set [Testing (multiple)] as well as CNVs larger than 5 Mbps [Testing (> 5 Mbps)]. Furthermore, likely benign, likely pathogenic and CNVs of uncertain significance were also evaluated together with CNVs from the basic Testing set. Potentially benign variants were collected from the GnomAD database and pathogenic CNVs from DECIPHER and OMIM databases as additional evaluation sets (implemented with app.diagram.net).

**Figure 2**
A 2-dimensional representation of the training datasets. We used the tSNE algorithm implemented in the scikit-learn package with default hyperparameters. Each dot represents a CNV, either benign (green) or pathogenic (red) (implemented with matplotlib package, version 3.3.2).

**Figure 3**
Comparison of the predictive capability of five studied models at three different probability thresholds (validation dataset). In the top row, the models classify all CNVs as either benign or pathogenic. “Correctly” predicted CNVs (being in line with ClinVar classification; either benign or pathogenic) are in green, while “incorrectly” predicted ones (that means the prediction unmatching the ClinVar classification) are in red. The middle row and the bottom row allow for uncertain predictions (shown in gray) if the probability of pathogenicity is between (1 − P_ct, P_ct), where P_ct is the probability threshold. The x-axis represents individual CNVs and corresponds to the sizes of the validation datasets. “Included” represents the percentage of CNVs evaluated by ISV with a clear outcome (with probabilities either above the probability threshold (P_ct) or below 1 − P_ct (implemented with matplotlib package, version 3.3.2 and pandas package, version 1.1.3).

**Figure 4**
Evaluation of ISV on CNVs with standard five-tier classification generally used for the classification of genomic variants in Mendelian diseases. Each CNV is represented by a dot while the color patterns reflect purely the five-tier ClinVar classification, i.e. neither the ISV prediction nor the “matching” status between ISV and ClinVar. The ISV prediction of pathogenicity is reflected on the y-axis while the value 1.0 means pathogenic prediction and 0.0 means benign prediction. Please note that these classes of variants are recommended by the respective ACMG/AMP guidelines. The sizes of datasets are provided in parentheses under the classification labels (implemented with seaborn package, version 0.11.0).

**Figure 5**
Numbers of correct (green), incorrect (red), and uncertain (gray) predictions on the test data. For ClassifyCNV and AnnotSV we treated likely benign and likely pathogenic predictions as uncertain significance. If we treated them as benign/pathogenic instead, we observed an increase in false predictions, while the added percentage of CNVs was not enough to categorize this as an improvement in the model’s performance (see Supplementary Fig. S11). The StrVCTVRE algorithm only classifies exonic CNVs, thus the ones shown as uncertain significance correspond to ones outside of exonic regions (implemented with matplotlib package, version 3.3.2 and pandas package, version 1.1.3).

**Figure 6**
Evaluation of ISV tool on gnomAD data. The x-axis represents the population frequencies of CNVs (black dots) with the ISV probability of pathogenicity on the y-axis. The figure shows that the majority of frequently occurring CNVs were classified as benign by ISV, while the ones with a higher probability of pathogenicity occur rarely (implemented with seaborn package, version 0.11.0).

**Figure 7**
Evaluation of pathogenic microdeletions and evaluation of pathogenic microduplications is stratified into two classes, showing that inclusion of critical region/gene may not be sufficient for correct prediction (implemented with seaborn package, version 0.11.0).

**Figure 8**
Force plot showing contributions of individual attributes towards the final prediction for a CNV causing Prader-Willi and Angelman syndrome (chr15:22760000–28560000). Bars represent individual attributes contributing to the prediction of this CNV with bar widths reflecting the strength of each attribute. In this case, all attributes contribute to the pathogenicity of the CNV, however, this will not always be the case. The base value represents the prior baseline value, from which the individual contributions are added/subtracted. If values of all attributes were equal to 0, the final prediction would be equal to the base value. Attributes are in order according to their strength in the prediction while “regulatory elements” being the most contributing genomic attribute. Hi-genes = haploinsufficient genes. The plot was constructed by utilizing functions from the SHAP package (version 0.37.0).

**Figure 9**
Circular genome plot with annotations by ISV. We divided the genome into 1 Mbp long non-overlapping CNVs and predicted their impact with ISV. The orange track shows probabilities of pathogenicity for copy number loss variants while the blue track shows this for copy number gain variants. The two inner tracks show the numbers of overlapped protein coding genes (black line) and overlapped curated regulatory elements (green line). The outer track shows the estimated chromosome bands according to the G-banding pattern. The plot was constructed using the R package circlize, version 0.4.2.

See this image and copyright information in PMC

References

1. Pös O, et al. Copy number variation: Methods and clinical applications. NATO Adv. Sci. Inst. Ser. E Appl. Sci. 2021;11:819.
1. Pös O, et al. DNA copy number variation: Main characteristics, evolutionary significance, and pathological aspects. Biomed. J. 2021;44:548–559. doi: 10.1016/j.bj.2021.02.003. - DOI - PMC - PubMed
1. Kucharik M, et al. Non-invasive prenatal testing (NIPT) by low coverage genomic sequencing: Detection limits of screened chromosomal microdeletions. PLoS One. 2020;15:e0238245. doi: 10.1371/journal.pone.0238245. - DOI - PMC - PubMed
1. Nowakowska B. Clinical interpretation of copy number variants in the human genome. J. Appl. Genet. 2017;58:449–457. doi: 10.1007/s13353-017-0407-4. - DOI - PMC - PubMed
1. Lupiáñez DG, et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell. 2015;161:1012–1025. doi: 10.1016/j.cell.2015.04.004. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

APVV-18-0319/Agentúra na Podporu Výskumu a Vývoja

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated prediction of the clinical impact of structural copy number variations

Affiliations

Automated prediction of the clinical impact of structural copy number variations

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources