Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 7;108(10):1946-1963.
doi: 10.1016/j.ajhg.2021.08.010. Epub 2021 Sep 15.

Identifying digenic disease genes via machine learning in the Undiagnosed Diseases Network

Affiliations

Identifying digenic disease genes via machine learning in the Undiagnosed Diseases Network

Souhrid Mukherjee et al. Am J Hum Genet. .

Abstract

Rare diseases affect millions of people worldwide, and discovering their genetic causes is challenging. More than half of the individuals analyzed by the Undiagnosed Diseases Network (UDN) remain undiagnosed. The central hypothesis of this work is that many of these rare genetic disorders are caused by multiple variants in more than one gene. However, given the large number of variants in each individual genome, experimentally evaluating combinations of variants for potential to cause disease is currently infeasible. To address this challenge, we developed the digenic predictor (DiGePred), a random forest classifier for identifying candidate digenic disease gene pairs by features derived from biological networks, genomics, evolutionary history, and functional annotations. We trained the DiGePred classifier by using DIDA, the largest available database of known digenic-disease-causing gene pairs, and several sets of non-digenic gene pairs, including variant pairs derived from unaffected relatives of UDN individuals. DiGePred achieved high precision and recall in cross-validation and on a held-out test set (PR area under the curve > 77%), and we further demonstrate its utility by using digenic pairs from the recent literature. In contrast to other approaches, DiGePred also appropriately controls the number of false positives when applied in realistic clinical settings. Finally, to enable the rapid screening of variant gene pairs for digenic disease potential, we freely provide the predictions of DiGePred on all human gene pairs. Our work enables the discovery of genetic causes for rare non-monogenic diseases by providing a means to rapidly evaluate variant gene pairs for the potential to cause digenic disease.

Keywords: UDN; Undiagnosed Diseases Network; clinical prediction; digenic disease; machine learning; oligogenic disease; rare disease.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Training sets and features used for machine-leaning-based identification of digenic disease gene pairs (A) Digenic gene pairs (positives) were derived from the Digenic Diseases Database (DIDA). Unique gene pair combinations (n = 140) were used for training and testing. Non-digenic gene pairs (negatives) were derived from unaffected relatives of UDN individuals. Genes with rare variants in the same individual were used as an unaffected non-digenic gene pair. We also considered several other negative training sets, including random gene pairs, permuted pairs of genes involved in digenic pairs, and gene pairs matched to attributes of digenic pairs (Figure S1). (B) We considered six network and functional features (NFFs) for training the digenic disease classifiers: (1) “pathway similarity,” Jaccard similarity of pathway annotations from KEGG and Reactome for both genes; (2) “phenotype similarity,” Jaccard similarity of phenotype annotations from HPO for both genes; (3) “co-expression rank,” co-expression rank of gene pair compared to all other gene pairs across multiple tissues from COXPRESdb; (4–6) “network distances” between the genes on protein-protein, pathway, and literature-mined interaction networks from UCSC gene and pathway interaction browser database. We also trained classifiers considering additional evolutionary and functional features (Figure S2).
Figure 2
Figure 2
Schematic of the protocol for training and evaluating the DiGePred digenic disease pair classifier Known digenic pairs (positives) and variant gene pairs from healthy individuals (negatives) were combined at ∼1:75 ratio. The combined pairs were divided into training (64%), validation (16%), and held-out test datasets (20%). The DiGePred random forest classifier was trained and cross-validated with the training and validation sets. The final performance estimate for the trained DiGePred classifier was quantified by the receiver operator characteristic (ROC) area under the curve (AUC) and precision-recall (PR) AUC on the held-out test set. This set was also used for establishing suggested thresholds on the continuous DiGePred score. DiGePred’s potential clinical utility was further demonstrated by applying it to an additional positive set of 13 novel digenic pairs from the recent literature, one novel gene pair in a resolved UDN individual, and an external set of non-digenic gene pairs from 38 unaffected relatives of UDN individuals.
Figure 3
Figure 3
Random forest classifiers can accurately distinguish digenic and non-digenic gene pairs via different feature sets (A and B) Performance of classifiers at distinguishing between known digenic pairs from DIDA (positives) and gene pairs from 25 healthy individuals (negatives) trained via different feature sets as evaluated by receiver operating characteristic (ROC) curves (A) and precision-recall (PR) curves (B). Classifiers trained on three sets of features are compared: (!) six network and functional features (NFFs) (dotted line); (2) the six NFFs and evolutionary genomics features; and (3) the six NFFs, evolutionary genomics features, and gene-level network and functional features. The mean curves across 10-fold cross-validation on the training and validation sets are plotted with shaded areas representing the standard deviation. Because this analysis is developing and evaluating multiple possible classifiers, we held out the test set for final evaluation (Figure 4).
Figure 4
Figure 4
Classifiers accurately distinguish digenic pairs from non-digenic pairs on held-out test sets (A and B) ROC (A) and PR (B) curves for random forest classifiers trained with all features on digenic gene pairs and various negative sets (indicated in the legend) and evaluated on the appropriate held-out test sets. These test sets consisted of DIDA held-out pairs as positives and six different held-out negative sets: (1) “unaffected,” derived from healthy relatives of UDN individuals (light blue); (2) “permuted,” derived by generating permutations of known digenic pairs (orange); (3) “random,” derived by randomly selecting pairs of genes (dark green); (4) “matched,” derived by matching the distribution of network and functional features observed among the digenic pairs (gray); (5) “unaffected no gene overlap,” derived from healthy relatives of UDN individuals and no genes in common between the training and test datasets (dark blue); (6) “random no gene overlap,” derived by randomly selecting pairs of genes with no genes in common between the training and test datasets (light green). The ROC AUCs were >0.97 in all cases, while the PR AUCs were >0.6 in all cases. In all subsequent analyses, the “unaffected no gene overlap” classifier will be referred to as “DiGePred.”
Figure 5
Figure 5
DiGePred accurately identifies novel digenic pairs from the recent literature Geometric shapes in red indicate the DiGePred scores assigned to 13 novel digenic pairs reported in the recent literature. The dashed pink and purple lines represent the DiGePred score thresholds that maximize the F1 (0.156) and the F0.5 (0.496) metrics (Figure S8). Given the importance of precision in clinical applications, we propose the score maximizing the F0.5 metric or higher as a threshold for calling a gene pair digenic. At this threshold, 11 of the 13 novel digenic pairs are predicted to be digenic with a low expected false positive rate (≤0.14%). All digenic pairs score above the F1 threshold. The DiGePred classifier was trained with all features and the unaffected no gene overlap set as negatives.
Figure 6
Figure 6
DiGePred has a low false positive rate and outperforms a recent digenic gene prediction method The number of digenic pairs identified for each of 38 healthy relatives of UDN individuals is plotted at a range of DiGePred thresholds (x axis) and for the highest confidence predictions (99% threshold) of the ORVAL/VarCoPP method. The DiGePred score thresholds that maximize the F1 and F0.5 metrics on the held-out data are shown in pink and purple, respectively. Because the individuals considered are healthy, any predicted digenic disease pairs are very likely false positives. DiGePred predicts significantly fewer digenic pairs at each threshold than ORVAL (Mann-Whitney U test, p values above each bar). At the F0.5 threshold, DiGePred predicts an average of under four digenic pairs per healthy individual and none above the 0.9 threshold, while ORVAL predicts an average of 830 digenic pairs per healthy individual at its strictest threshold (Figure S10). Results were similar for classifiers trained on other negative sets (Figures S10–S18).

References

    1. Ng S.B., Turner E.H., Robertson P.D., Flygare S.D., Bigham A.W., Lee C., Shaffer T., Wong M., Bhattacharjee A., Eichler E.E. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. - PMC - PubMed
    1. Ionita-Laza I., Makarov V., Yoon S., Raby B., Buxbaum J., Nicolae D.L., Lin X. Finding disease variants in Mendelian disorders by using sequence data: methods and applications. Am. J. Hum. Genet. 2011;89:701–712. - PMC - PubMed
    1. Ng S.B., Buckingham K.J., Lee C., Bigham A.W., Tabor H.K., Dent K.M., Huff C.D., Shannon P.T., Jabs E.W., Nickerson D.A. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 2010;42:30–35. - PMC - PubMed
    1. Boycott K.M., Rath A., Chong J.X., Hartley T., Alkuraya F.S., Baynam G., Brookes A.J., Brudno M., Carracedo A., Den Dunnen J.T. International Cooperation to Enable the Diagnosis of All Rare Genetic Diseases. Am. J. Hum. Genet. 2017;100:695–705. - PMC - PubMed
    1. Chong J.X., Buckingham K.J., Jhangiani S.N., Boehm C., Sobreira N., Smith J.D., Harrell T.M., Mcmillin M.J., Wiszniewski W., Gambin T. Challenges, and Opportunities; 2015. The Genetic Basis of Mendelian Phenotypes: Discoveries. - PMC - PubMed

Publication types