Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep;141(9):1499-1513.
doi: 10.1007/s00439-021-02387-9. Epub 2021 Oct 20.

Interpretable machine learning for genomics

Affiliations

Interpretable machine learning for genomics

David S Watson. Hum Genet. 2022 Sep.

Abstract

High-throughput technologies such as next-generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state-of-the-art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines.

PubMed Disclaimer

Conflict of interest statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Figures

Fig. 1
Fig. 1
The classic bioinformatics workflow spans data collection, model training, and deployment. iML augments this pipeline with an extra interpretation step, which can be used during training and throughout deployment (incoming solid edges). Algorithmic explanations (outgoing dashed edges) can be used to guide new data collection, refine training, and monitor models during deployment
Fig. 2
Fig. 2
A saliency map visually explains a cancer diagnosis based on whole-slide pathology data. The highlighted regions on the right pick out the elements of the image that the algorithm deemed most strongly associated with malignancy. From (Zhang et al. , p. 237)
Fig. 3
Fig. 3
A complex decision boundary (the pink blob/blue background) separates red crosses from blue circles. This function cannot be well-approximated by a linear model, but the boundary near the large red cross is roughly linear, as indicated by the dashed line. From (Ribeiro et al. , p. 1138)
Fig. 4
Fig. 4
Shapley values show that high white blood cell counts increase the negative risk conferred by high blood urea nitrogen for progression to end stage renal disease (ESRD). From (Lundberg et al. , p. 61)
Fig. 5
Fig. 5
Example rule lists for AMR prediction from genotype data. Each rule detects the presence/absence of a k-mer and is colored according to the genomic locus at which it was found. From (Drouin et al. , p. 4)

References

    1. Aas K, Jullum M, Løland A. Explaining individual predictions when features are dependent: more accurate approximations to Shapley values. Artif Intell. 2021;298:103502. doi: 10.1016/j.artint.2021.103502. - DOI
    1. Adadi A, Berrada M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI) IEEE Access. 2018;6:52138–52160. doi: 10.1109/ACCESS.2018.2870052. - DOI
    1. Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26(10):1340–1347. doi: 10.1093/bioinformatics/btq134. - DOI - PubMed
    1. Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C. Learning certifiably optimal rule lists for categorical data. J Mach Learn Res. 2018;18(234):1–78.
    1. Anguita-Ruiz A, Segura-Delgado A, Alcalá R, Aguilera CM, Alcalá-Fdez J. eXplainable artificial intelligence (XAI) for the identification of biologically relevant gene expression patterns in longitudinal human studies, insights from obesity research. PLoS Comput Biol. 2020;16(4):e1007792. doi: 10.1371/journal.pcbi.1007792. - DOI - PMC - PubMed

LinkOut - more resources