Review

. 2022 Sep;141(9):1515-1528.

doi: 10.1007/s00439-021-02402-z. Epub 2021 Dec 4.

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Anthony M Musolf¹, Emily R Holzinger², James D Malley¹, Joan E Bailey-Wilson³

Affiliations

¹ Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
² Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA.
³ Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA. jebw@mail.nih.gov.

PMID: 34862561
PMCID: PMC9360120
DOI: 10.1007/s00439-021-02402-z

Review

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Anthony M Musolf et al. Hum Genet. 2022 Sep.

. 2022 Sep;141(9):1515-1528.

doi: 10.1007/s00439-021-02402-z. Epub 2021 Dec 4.

Authors

Anthony M Musolf¹, Emily R Holzinger², James D Malley¹, Joan E Bailey-Wilson³

Affiliations

¹ Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
² Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA.
³ Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA. jebw@mail.nih.gov.

PMID: 34862561
PMCID: PMC9360120
DOI: 10.1007/s00439-021-02402-z

Abstract

Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.

PubMed Disclaimer

Conflict of interest statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Figures

**Fig. 1**
k-nearest neighbors. A diagram showing an example of the k-nearest neighbor machine. Subjects are plotted based on feature values, and an individual’s classification is determined by a majority vote in the subject’s neighborhood (k). The choosing of k is crucial to classification. For instance, if we wished to classify the green individual based on k = 4, the individual would be classified as blue. If we extended this to k = 9, the individual would be classified as red

**Fig. 2**
Classification and Regression Trees (CART) and Random Forest. a Diagram showing a single CART. CARTs take a heterogeneous group of data and repeatedly split on feature values to create more homogeneous groups. b Diagram showing a random forest. A random forest is a collection of CARTs, each running on a slightly different subset of the same data set

**Fig. 3**
Artificial neural networks. A schematic of an artificial neural network. Data are analyzed by different models, the results of which are passed onto a new set of models. In this example, data are first analyzed in the input layer (blue). The results are then passed onto an intermediate layer, called a hidden layer (green). Finally, the results of the hidden layer are passed onto and analyzed by the models of the output layer (red)

**Fig. 4**
Deep learning. A schematic of a deep learning machine. Deep learning is a specialized version of artificial neural networks that contain many additional hidden layers

**Fig. 5**
Support vector machines. A diagram showing an example of a support vector machine. Subjects are plotted based on feature values, and a special boundary called the hyperplane is formed to classify individuals. The hyperplane is oriented as far as possible from the two closest individuals in each class (in this example, the orange and purple individuals)

See this image and copyright information in PMC

Cited by

Enhancing the classification of seismic events with supervised machine learning and feature importance.
Habbak EL, Abdalzaher MS, Othman AS, Mansour HA. Habbak EL, et al. Sci Rep. 2024 Dec 24;14(1):30638. doi: 10.1038/s41598-024-81113-7. Sci Rep. 2024. PMID: 39719528 Free PMC article.
Editorial: Medical knowledge-assisted machine learning technologies in individualized medicine.
Gao F, Cho WC, Gao X, Wang W. Gao F, et al. Front Mol Biosci. 2023 Mar 24;10:1167730. doi: 10.3389/fmolb.2023.1167730. eCollection 2023. Front Mol Biosci. 2023. PMID: 37033449 Free PMC article. No abstract available.
Overcoming Limitations to Deep Learning in Domesticated Animals with TrioTrain.
Kalleberg J, Rissman J, Schnabel RD. Kalleberg J, et al. bioRxiv [Preprint]. 2024 Apr 20:2024.04.15.589602. doi: 10.1101/2024.04.15.589602. bioRxiv. 2024. Update in: Genome Res. 2025 Aug 1;35(8):1859-1874. doi: 10.1101/gr.279542.124. PMID: 38659907 Free PMC article. Updated. Preprint.
Machine learning discrimination of Gleason scores below GG3 and above GG4 for HSPC patients diagnosis.
Zhu B, Dai L, Wang H, Zhang K, Zhang C, Wang Y, Yin F, Li J, Ning E, Wang Q, Yang L, Yang H, Li R, Li J, Hu C, Wu H, Jiang H, Bai Y. Zhu B, et al. Sci Rep. 2024 Oct 27;14(1):25641. doi: 10.1038/s41598-024-77033-1. Sci Rep. 2024. PMID: 39465343 Free PMC article.
Deciphering colorectal cancer radioresistance and immune microrenvironment: unraveling the role of EIF5A through single-cell RNA sequencing and machine learning.
Zhong Y, Chen X, Wu S, Fang H, Hong L, Shao L, Wang L, Wu J. Zhong Y, et al. Front Immunol. 2024 Sep 3;15:1466226. doi: 10.3389/fimmu.2024.1466226. eCollection 2024. Front Immunol. 2024. PMID: 39290702 Free PMC article.

See all "Cited by" articles

References

1. Abo Alchamlat S, Farnir F. KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies. BMC Bioinform. 2017;18:184. doi: 10.1186/s12859-017-1599-7. - DOI - PMC - PubMed
1. Abu Alfeilat HA, Hassanat ABA, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS. Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data. 2019;7:221–248. doi: 10.1089/big.2018.0175. - DOI - PubMed
1. Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26:1340–1347. doi: 10.1093/bioinformatics/btq134. - DOI - PubMed
1. Arloth J, Eraslan G, Andlauer TFM, Martins J, Iurato S, Kühnel B, Waldenberger M, Frank J, Gold R, Hemmer B, Luessi F, Nischwitz S, Paul F, Wiendl H, Gieger C, Heilmann-Heimbach S, Kacprowski T, Laudes M, Meitinger T, Peters A, Rawal R, Strauch K, Lucae S, Müller-Myhsok B, Rietschel M, Theis FJ, Binder EB, Mueller NS. DeepWAS: Multivariate genotype-phenotype associations by directly integrating regulatory information using deep learning. PLoS Comput Biol. 2020;16:e1007616. doi: 10.1371/journal.pcbi.1007616. - DOI - PMC - PubMed
1. Basile AO, Ritchie MD. Informatics and machine learning to define the phenotype. Expert Rev Mol Diagn. 2018;18:219–226. doi: 10.1080/14737159.2018.1439380. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Affiliations

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous