Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 16:8:771607.
doi: 10.3389/fmed.2021.771607. eCollection 2021.

An Introduction to Machine Learning Approaches for Biomedical Research

Affiliations

An Introduction to Machine Learning Approaches for Biomedical Research

Juan Jovel et al. Front Med (Lausanne). .

Abstract

Machine learning (ML) approaches are a collection of algorithms that attempt to extract patterns from data and to associate such patterns with discrete classes of samples in the data-e.g., given a series of features describing persons, a ML model predicts whether a person is diseased or healthy, or given features of animals, it predicts weather an animal is treated or control, or whether molecules have the potential to interact or not, etc. ML approaches can also find such patterns in an agnostic manner, i.e., without having information about the classes. Respectively, those methods are referred to as supervised and unsupervised ML. A third type of ML is reinforcement learning, which attempts to find a sequence of actions that contribute to achieving a specific goal. All of these methods are becoming increasingly popular in biomedical research in quite diverse areas including drug design, stratification of patients, medical images analysis, molecular interactions, prediction of therapy outcomes and many more. We describe several supervised and unsupervised ML techniques, and illustrate a series of prototypical examples using state-of-the-art computational approaches. Given the complexity of reinforcement learning, it is not discussed in detail here, instead, interested readers are referred to excellent reviews on that topic. We focus on concepts rather than procedures, as our goal is to attract the attention of researchers in biomedicine toward the plethora of powerful ML methods and their potential to leverage basic and applied research programs.

Keywords: biomedical research; machine learning; reinforcement learning; supervised learning; unsupervised learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Representative machine learning algorithms. Machine learning is a subfield of artificial intelligence and can be divided into supervised, unsupervised and reinforcement learning. The list of algorithms in each subfield is not exhaustive but instead are the most popular algorithms in each subfield. k-NN, k nearest neighbors; PCA, Principal components analysis; NMF, Non-negative matrix factorization; t-SNE, T-distributed stochastic neighbor embedding; DQNs,Deep Q networks; SARSA, State-action-reward-state-action; DDPG, Deep deterministic policy gradient.
Figure 2
Figure 2
Illustration of supervised learning algorithms. (A) Relationship between number of neighbors (k) and accuracy in the k-NN algorithm when applied to the hepatitis dataset. (B) Feature importance when the random forest algorithm was applied to the hepatitis dataset. (C) Tri-dimensional scatter plot of values of albumin, bilirubin and protime in patients included in the hepatitis dataset. (D) Decision surface of the logistic regression model applied to the hepatitis dataset illustrated in a two dimensional plot including only albumin and bilirubin. (E) Comparison of the theoretical probability distribution of a logistic regression model with the probability distribution of survival of patients in the hepatitis dataset when only albumin is considered as regressor. (F) Lollipop plot of accuracy achieved during classification of survival in the hepatitis dataset. k-NN, k nearest network; SVC, Support vector classifier; LogReg, Logistic regression (R squared); SGDC, Stochastic gradient descent classifier; DTC, Decision tree classifier; RFC, Random forest classifier; GBC, Gradient boosting classifier; MLPC, Multilayer perceptron classifier.
Figure 3
Figure 3
Illustration of artificial neural networks (ANNs). (A) Perceptron, a neural network with a single neuron. ∑ represents the weighted sum, f represent the activation function, and b represents the bias term. (B) Deep neural network with three hidden layers. Each interconnected node represents a neuron. (C) Neural network with a single hidden layer. (D) Heat map representation of weights in the multilayer perceptron model applied to the hepatitis dataset.
Figure 4
Figure 4
Application of a convolutional neural network (CNN) to classify cancer tissue. (A) CNNs are emulations of sets of neurons that detect individual overlapping visual sections called receptive fields. Such neurons detect simple features of the object, such as lines and arcs. Deeper neuronal layers detect more complex shapes derived from the initial elements and progressively the whole object is resolved. (B) Representative patches from the invasive ductal carcinoma (IDC) tissue sections described in Cruz-Roa et al. (46) and classified as non-cancerous tissue by a pathologist. (C) Representative patches from the invasive ductal carcinoma (IDC) tissue sections classified as cancerous tissue by a pathologist. In (B,C), darker regions correspond to nuclei stained with hematoxylin, which appear denser in (C) probably due to increased cell proliferation in cancerous tissues. (D) Digital reconstruction of tissue sections from individual patches in six different patients. Red regions represent non-cancerous tissue, while blue regions represent cancerous tissue. (E) Accuracy for training and test data sets obtained when a CNN was applied to the IDC dataset.
Figure 5
Figure 5
Clustering of microglia single cell transcriptomes using tSNE, PCA or MDS. (A) Determination of optimal number of clusters through K-means clustering. Validation of cluster number (n) with the Silhouette method when two (B) or three (C) clusters were considered. (D) Transcriptome samples clustered with tSNE (n = 2) and colored with a single color, or with two colors (E). Such clusters correspond to naive and lesion cells. (F) Transcriptome samples clustered with tSNE (n = 3) and colored with three colors. Such clusters correspond to two naive and one lesion groups of cells. (G) Representative differentially expressed genes between naive and lesion cells (see E), were in agreement with (8). (H) Transcriptome samples clustered by PCA or with MDS (I).
Figure 6
Figure 6
Clustering of microglia single cell transcriptomes using autoencoders. (A) Cartoon depicting an autoencoder neural network. When aiming at discriminating between two clusters, we used Tanh (B,C) and GELU (D) as activation functions for hidden layers, and either Poisson NLL (pnl) (B) or Kullback Liebler divergence (kv_div) (C,D) loss functions. When aiming at discriminating between three clusters, either ReLU (E), or sigmoid (F,G) were used as activation functions. Loss functions are also indicated (E–G).

References

    1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York, NY: Springer Science & Business Media; (2009).
    1. Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. (2015) 349:255–60. 10.1126/science.aaa8415 - DOI - PubMed
    1. Müller AC, Guido S. Introduction to Machine Learning with Python: A Guide for Data Scientists. Sebastopol, CA: O'Reilly Media, Inc. (2016).
    1. Ayodele TO. Types of machine learning algorithms. New Adv Mach Learn. (2010) 3:19–48. 10.5772/9385 - DOI
    1. Berry MW, Mohamed A, Yap BW. Supervised and Unsupervised Learning for Data Science. Cham: Springer Nature; (2019). 10.1007/978-3-030-22475-2 - DOI

LinkOut - more resources