Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb 21;2(1):lqaa011.
doi: 10.1093/nargab/lqaa011. eCollection 2020 Mar.

An interpretable low-complexity machine learning framework for robust exome-based in- silico diagnosis of Crohn's disease patients

Affiliations

An interpretable low-complexity machine learning framework for robust exome-based in- silico diagnosis of Crohn's disease patients

Daniele Raimondi et al. NAR Genom Bioinform. .

Abstract

Whole exome sequencing (WES) data are allowing researchers to pinpoint the causes of many Mendelian disorders. In time, sequencing data will be crucial to solve the genome interpretation puzzle, which aims at uncovering the genotype-to-phenotype relationship, but for the moment many conceptual and technical problems need to be addressed. In particular, very few attempts at the in-silico diagnosis of oligo-to-polygenic disorders have been made so far, due to the complexity of the challenge, the relative scarcity of the data and issues such as batch effects and data heterogeneity, which are confounder factors for machine learning (ML) methods. Here, we propose a method for the exome-based in-silico diagnosis of Crohn's disease (CD) patients which addresses many of the current methodological issues. First, we devise a rational ML-friendly feature representation for WES data based on the gene mutational burden concept, which is suitable for small sample sizes datasets. Second, we propose a Neural Network (NN) with parameter tying and heavy regularization, in order to limit its complexity and thus the risk of over-fitting. We trained and tested our NN on 3 CD case-controls datasets, comparing the performance with the participants of previous CAGI challenges. We show that, notwithstanding the limited NN complexity, it outperforms the previous approaches. Moreover, we interpret the NN predictions by analyzing the learned patterns at the variant and gene level and investigating the decision process leading to each prediction.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Figure showing the structure of the CDkoma NN. The feature vector describing each exome is a tensor with shape (Fg × Ng), encoding the Ng most relevant CD genes. The G neuron is applied iteratively to each 11-dimensional feature vectors encoding the CD genes, each time using the same weights Wg. The Ng LeakyReLU activations from the applications of the G neuron are filtered by a Dropout layer and the final neuron P aggregates all the resulting activations into a logistic regression that yields the final, probability-like, diagnostic score.
Figure 2.
Figure 2.
Plot showing the normalized absolute value of the weights learned by the G neuron on the CAGI3 (blue) and 4 (orange) datasets.
Figure 3.
Figure 3.
Figure showing the 25 most relevant genes in the models trained on the CAGI3 and 4 datasets. The asterisk before the gene name indicates that the gene has been selected as highly relevant while training on both datasets.
Figure 4.
Figure 4.
NN gene activation patterns when predicting the CAGI4 dataset. Samples are listed on the y-axis and the asterisks indicates positive samples. The genes with the highest activations are shown on the x-axis. Red colors indicate that the activation is pushing towards the positive class (CD case), while blue colors indicate genes voting for the negative (controls) class.

References

    1. Van Dijk E.L., Auger H., Jaszczyszyn Y., Thermes C.. Ten years of next-generation sequencing technology. Trends Genet. 2014; 30:418–426. - PubMed
    1. Bamshad M.J., Ng S.B., Bigham A.W., Tabor H.K., Emond M.J., Nickerson D.A., Shendure J.. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 2011; 12:745–755. - PubMed
    1. Ng P.C., Levy S., Huang J., Stockwell T.B., Walenz B.P., Li K., Axelrod N., Busam D.A., Strausberg R.L., Venter J.C.. Genetic variation in an individual human exome. PLoS Genet. 2008; 4:e1000160. - PMC - PubMed
    1. Boycott K.M., Vanstone M.R., Bulman D.E., MacKenzie A.E.. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev. Genet. 2013; 14:681–691. - PubMed
    1. Daneshjou R., Wang Y., Bromberg Y., Bovo S., Martelli P.L., Babbi G., Lena P.D., Casadio R., Edwards M., Gifford D. et al. .. Working toward precision medicine: Predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges. Hum. Mutat. 2017; 38:1182–1192. - PMC - PubMed