Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Sep;141(9):1515-1528.
doi: 10.1007/s00439-021-02402-z. Epub 2021 Dec 4.

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Affiliations
Review

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Anthony M Musolf et al. Hum Genet. 2022 Sep.

Abstract

Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.

PubMed Disclaimer

Conflict of interest statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Figures

Fig. 1
Fig. 1
k-nearest neighbors. A diagram showing an example of the k-nearest neighbor machine. Subjects are plotted based on feature values, and an individual’s classification is determined by a majority vote in the subject’s neighborhood (k). The choosing of k is crucial to classification. For instance, if we wished to classify the green individual based on k = 4, the individual would be classified as blue. If we extended this to k = 9, the individual would be classified as red
Fig. 2
Fig. 2
Classification and Regression Trees (CART) and Random Forest. a Diagram showing a single CART. CARTs take a heterogeneous group of data and repeatedly split on feature values to create more homogeneous groups. b Diagram showing a random forest. A random forest is a collection of CARTs, each running on a slightly different subset of the same data set
Fig. 3
Fig. 3
Artificial neural networks. A schematic of an artificial neural network. Data are analyzed by different models, the results of which are passed onto a new set of models. In this example, data are first analyzed in the input layer (blue). The results are then passed onto an intermediate layer, called a hidden layer (green). Finally, the results of the hidden layer are passed onto and analyzed by the models of the output layer (red)
Fig. 4
Fig. 4
Deep learning. A schematic of a deep learning machine. Deep learning is a specialized version of artificial neural networks that contain many additional hidden layers
Fig. 5
Fig. 5
Support vector machines. A diagram showing an example of a support vector machine. Subjects are plotted based on feature values, and a special boundary called the hyperplane is formed to classify individuals. The hyperplane is oriented as far as possible from the two closest individuals in each class (in this example, the orange and purple individuals)

Similar articles

Cited by

References

    1. Abo Alchamlat S, Farnir F. KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies. BMC Bioinform. 2017;18:184. doi: 10.1186/s12859-017-1599-7. - DOI - PMC - PubMed
    1. Abu Alfeilat HA, Hassanat ABA, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS. Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data. 2019;7:221–248. doi: 10.1089/big.2018.0175. - DOI - PubMed
    1. Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26:1340–1347. doi: 10.1093/bioinformatics/btq134. - DOI - PubMed
    1. Arloth J, Eraslan G, Andlauer TFM, Martins J, Iurato S, Kühnel B, Waldenberger M, Frank J, Gold R, Hemmer B, Luessi F, Nischwitz S, Paul F, Wiendl H, Gieger C, Heilmann-Heimbach S, Kacprowski T, Laudes M, Meitinger T, Peters A, Rawal R, Strauch K, Lucae S, Müller-Myhsok B, Rietschel M, Theis FJ, Binder EB, Mueller NS. DeepWAS: Multivariate genotype-phenotype associations by directly integrating regulatory information using deep learning. PLoS Comput Biol. 2020;16:e1007616. doi: 10.1371/journal.pcbi.1007616. - DOI - PMC - PubMed
    1. Basile AO, Ritchie MD. Informatics and machine learning to define the phenotype. Expert Rev Mol Diagn. 2018;18:219–226. doi: 10.1080/14737159.2018.1439380. - DOI - PMC - PubMed

LinkOut - more resources