Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 1;24(1):179.
doi: 10.1186/s12859-023-05291-3.

Faster and more accurate pathogenic combination predictions with VarCoPP2.0

Affiliations

Faster and more accurate pathogenic combination predictions with VarCoPP2.0

Nassim Versbraegen et al. BMC Bioinformatics. .

Abstract

Background: The prediction of potentially pathogenic variant combinations in patients remains a key task in the field of medical genetics for the understanding and detection of oligogenic/multilocus diseases. Models tailored towards such cases can help shorten the gap of missing diagnoses and can aid researchers in dealing with the high complexity of the derived data. The predictor VarCoPP (Variant Combinations Pathogenicity Predictor) that was published in 2019 and identified potentially pathogenic variant combinations in gene pairs (bilocus variant combinations), was the first important step in this direction. Despite its usefulness and applicability, several issues still remained that hindered a better performance, such as its False Positive (FP) rate, the quality of its training set and its complex architecture.

Results: We present VarCoPP2.0: the successor of VarCoPP that is a simplified, faster and more accurate predictive model identifying potentially pathogenic bilocus variant combinations. Results from cross-validation and on independent data sets reveal that VarCoPP2.0 has improved in terms of both sensitivity (95% in cross-validation and 98% during testing) and specificity (5% FP rate). At the same time, its running time shows a significant 150-fold decrease due to the selection of a simpler Balanced Random Forest model. Its positive training set now consists of variant combinations that are more confidently linked with evidence of pathogenicity, based on the confidence scores present in OLIDA, the Oligogenic Diseases Database ( https://olida.ibsquare.be ). The improvement of its performance is also attributed to a more careful selection of up-to-date features identified via an original wrapper method. We show that the combination of different variant and gene pair features together is important for predictions, highlighting the usefulness of integrating biological information at different levels.

Conclusions: Through its improved performance and faster execution time, VarCoPP2.0 enables a more accurate analysis of larger data sets linked to oligogenic diseases. Users can access the ORVAL platform ( https://orval.ibsquare.be ) to apply VarCoPP2.0 on their data.

Keywords: Balanced random forest; Oligogenic diseases; Pathogenicity predictor; Variant combinations.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
ROC-AUC for various model structure variations. The blue line represents the evolution of APS when varying the number of decision trees in one single RF. The input for the RF was the entire set of positive instances and a balanced random sample (1:1 ratio) of negative instances. The green line represents the evolution of APS for a single balanced RF, where the number of decision trees composing the balanced RF varies. The input for each Balanced RF was the same full training data set (with it’s 1:500 imbalance ratio). The orange line represents the evolution of APS when varying the number of decision trees present in each of the 500 RFs in an ensemble RF model, similarly to the first version of VarCoPP. The input for each RF was the entire set of positive instances and an equal amount of negative instances, specific to each RF. The red line represents the evolution of APS for different numbers of RFs in an ensemble RF model. Each RF consisted of 100 decision trees and its input was the entire set of positive instances and an equal amount of negative instances, unique for each RF
Fig. 2
Fig. 2
ROC- and PR-curve for LOGO cross-validation and independent validation: a ROC-curve based on Balanced RF prediction probabilities in both stratified LOGO cross validation (Blue) and validation set (Orange) settings. b PR-curve based on Balanced RF prediction probabilities in both stratified LOGO cross validation (Blue) and validation set (Orange) settings
Fig. 3
Fig. 3
Density plot of the prediction probabilities for the 10,000 neutral 1KGP variant combinations used as a negative validation set. The X-axis represents the prediction probabilities, while the Y-axis shows the number of instances that were assigned the corresponding probability score. The plot visually presents the Disease-causing (green vertical line), 99% (orange vertical line) and 99.9% (red vertical line) confidence zones’ thresholds, indicating that 1% or less of all samples are to the right of the green line (i.e. were assigned a higher probability) and similarly 0.1% or less of all samples are to the right of the red line
Fig. 4
Fig. 4
Comparison between VarCoPP and VarCoPP2.0 in terms of execution time needed to classify a certain number of variant-combinations (x-axis), shown with logarithmic y-scale
Fig. 5
Fig. 5
Boxplot of the Gini importance per feature in the Balanced RF trained on the entire set of training data. A higher Gini importance value indicates a higher contribution for the prediction (regardless of whether the prediction is positive or negative)
Fig. 6
Fig. 6
Boxplot of the feature contributions for either the disease-causing or the neutral class, among all positive instances (a) or all negative instances (b) in the validation set inferred using treeinterpreter. A feature contribution value above 0 indicates a vote for a positive prediction (i.e. towards the disease-causing class), while a value below 0 indicates a vote for negative prediction (i.e. towards the neutral class). The more the feature contribution value deviates from 0, the stronger the vote is for either class

References

    1. Rahit KMTH, Tarailo-Graovac M. Genetic modifiers and rare mendelian disease. Genes. 2020 doi: 10.3390/genes11030239. - DOI - PMC - PubMed
    1. Badano JL, Katsanis N. Beyond Mendel: an evolving view of human genetic disease transmission. Nat Rev Genet. 2022;3(6):779–789. doi: 10.1038/nrg910. - DOI - PubMed
    1. Robinson JF, Katsanis N. Oligogenic disease. 2010;243–62. Chap. 7. 10.1007/978-3-540-37654-5.
    1. Okazaki A, Ott J. Machine learning approaches to explore digenic inheritance. Trends Genet. 2022. - PubMed
    1. Ott J, Park T. Overview of frequent pattern mining. Genom Inform. 2022;20(4). - PMC - PubMed

LinkOut - more resources