. 2021 Apr 19;22(1):198.

doi: 10.1186/s12859-021-04077-9.

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods

Muhammad Muneeb^#¹, Andreas Henschel^#²

Affiliations

¹ Department of Electrical Engineering and Computer Science, Center for Biotechnology Khalifa University, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.
² Department of Electrical Engineering and Computer Science, Center for Biotechnology Khalifa University, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates. andreas.henschel@ku.ac.ae.

^# Contributed equally.

PMID: 33874881
PMCID: PMC8056510
DOI: 10.1186/s12859-021-04077-9

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods

Muhammad Muneeb et al. BMC Bioinformatics. 2021.

. 2021 Apr 19;22(1):198.

doi: 10.1186/s12859-021-04077-9.

Authors

Muhammad Muneeb^#¹, Andreas Henschel^#²

Affiliations

¹ Department of Electrical Engineering and Computer Science, Center for Biotechnology Khalifa University, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.
² Department of Electrical Engineering and Computer Science, Center for Biotechnology Khalifa University, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates. andreas.henschel@ku.ac.ae.

^# Contributed equally.

PMID: 33874881
PMCID: PMC8056510
DOI: 10.1186/s12859-021-04077-9

Erratum in

Correction to: Eye‑color and Type‑2 diabetes phenotype prediction from genotype data using deep learning methods.
Muneeb M, Henschel A. Muneeb M, et al. BMC Bioinformatics. 2021 Jun 11;22(1):319. doi: 10.1186/s12859-021-04218-0. BMC Bioinformatics. 2021. PMID: 34116644 Free PMC article. No abstract available.

Abstract

Background: Genotype-phenotype predictions are of great importance in genetics. These predictions can help to find genetic mutations causing variations in human beings. There are many approaches for finding the association which can be broadly categorized into two classes, statistical techniques, and machine learning. Statistical techniques are good for finding the actual SNPs causing variation where Machine Learning techniques are good where we just want to classify the people into different categories. In this article, we examined the Eye-color and Type-2 diabetes phenotype. The proposed technique is a hybrid approach consisting of some parts from statistical techniques and remaining from Machine learning.

Results: The main dataset for Eye-color phenotype consists of 806 people. 404 people have Blue-Green eyes where 402 people have Brown eyes. After preprocessing we generated 8 different datasets, containing different numbers of SNPs, using the mutation difference and thresholding at individual SNP. We calculated three types of mutation at each SNP no mutation, partial mutation, and full mutation. After that data is transformed for machine learning algorithms. We used about 9 classifiers, RandomForest, Extreme Gradient boosting, ANN, LSTM, GRU, BILSTM, 1DCNN, ensembles of ANN, and ensembles of LSTM which gave the best accuracy of 0.91, 0.9286, 0.945, 0.94, 0.94, 0.92, 0.95, and 0.96% respectively. Stacked ensembles of LSTM outperformed other algorithms for 1560 SNPs with an overall accuracy of 0.96, AUC = 0.98 for brown eyes, and AUC = 0.97 for Blue-Green eyes. The main dataset for Type-2 diabetes consists of 107 people where 30 people are classified as cases and 74 people as controls. We used different linear threshold to find the optimal number of SNPs for classification. The final model gave an accuracy of 0.97%.

Conclusion: Genotype-phenotype predictions are very useful especially in forensic. These predictions can help to identify SNP variant association with traits and diseases. Given more datasets, machine learning model predictions can be increased. Moreover, the non-linearity in the Machine learning model and the combination of SNPs Mutations while training the model increases the prediction. We considered binary classification problems but the proposed approach can be extended to multi-class classification.

Keywords: Bioinformatics; Eye color; Genotype–phenotype; Machine learning; Type-2 diabetes.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Flowchart of machine learning approach for genotype phenotype predictions. This flowchart presents an overview of the hybrid approach for genotype–phenotype prediction. After cleaning data, multiple datasets were generated using mutation thresholding, containing different numbers of SNPs. Different machine learning algorithms with various hyper-parameters were considered for training the model

**Fig. 2**
Artificial network network structure. Selected SNPs are passed to a fully connected network. Each connection represents the weight learned by the model. The number of hidden layers and the number of neurons in each layer can be changed. Each circle is a processing unit which will perform will perform activation function on a combination of input from the previous layer. It is a binary classification problem so the output layer contains 2 processing units [39]

**Fig. 3**
One dimensional architecture. Selected SNPs are passed to a 1DCNN. N, X, Y, and Z represent the size of the input layer, and X, Y, Z represent the filter size for the first layer, second layer, and third layer. A and B represents the number of the filter in the first layer and second layer. As it is 1DCNN so kernel size or filter size has one dimension equal to 1 and the other is variable. The number of hidden layers, the number of filters in each layer, and the size of the filter can be changed. It is important to form the proper model. At the end output of the last 1DCNN layer, after global averaging, is connected to the fully connected network. In a fully connected network number of layers and the number of neurons in each layer can also be changed [45]

**Fig. 4**
LSTM architecture. Selected SNPs are passed to a LSTM cell

**Fig. 5**
Random forest. Dataset after preprocessing is passed to each Decision Tree. Each decision tree is trained on train data and for each test sample prediction from each decision tree is considered. The final decision for each sample is based on Majority voting. The depth of the tree determines the number of SNPs used for classification. SNPs with high idnformation gain on the top

**Fig. 6**
Extreme gradient boosting. XGBOOST trains models in succession, with each new model being trained to correct the errors made by the previous ones. Models are added sequentially until no further improvements can be made

**Fig. 7**
Ensemble approach. “Find the best models” show the approach to find the best model. Each combination of the different parameters is executed to find the best model. This is computationally expensive to find the models to be included in the ensemble. “Ensemble of best models” shows the final model. All the models which are to be used in the ensemble are non-trainable and their output is combined and connected with the fully connected network to produce the final model

**Fig. 8**
Confusion matrices of the 10 LSTM models used for the stacked ensemble model. There are few models that are good at classifying the Brown eyes and others at Blue-Green. Consider Model 3 which classifies Brown eyes very well, whereas model 4 performs well on Blue-Green. When results of such models are combined optimal result is obtained. There are few models that perform equally well for both classes like model 7

**Fig. 9**
Accuracy and Loss of the best ensemble of LSTM for training. The final stacked model is training for 10 Epochs to avoid overfitting. The first plot shows the model accuracy on training data and the second plot shows the model loss for training data

**Fig. 10**
Confusion matrix of the best ensemble of LSTM

**Fig. 11**
ROC of the best ensemble of LSTM. ROC for class 0 which is Brown eyes is 0.98, ROC for class 1 which is Blue-Green eyes is 0.98

**Fig. 12**
Confusion matrix of the best random forest model

**Fig. 13**
ROC for the best random forest model. ROC for class 0 which is controls is 0.95, ROC for class 1 which is Cases is 0.95

See this image and copyright information in PMC

References

1. Bateson P. Why are individuals so different from each other? Heredity. 2014;115(4):285–292. doi: 10.1038/hdy.2014.103. - DOI - PMC - PubMed
1. The ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
1. Kubiak MR, Makałowska I. Protein-coding genes’ retrocopies and their functions. Viruses. 2017;9(4):80. doi: 10.3390/v9040080. - DOI - PMC - PubMed
1. Basic genetics information—understanding genetics—NCBI bookshelf. https://www.ncbi.nlm.nih.gov/books/NBK115558/. Accessed 30 Nov 2020.
1. Understanding genetics: a New York, mid-Atlantic guide for patients and health professionals—PubMed. https://pubmed.ncbi.nlm.nih.gov/23304754/. Accessed 30 Nov 2020. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods

Affiliations

Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical