Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Jul 31:19:4345-4359.
doi: 10.1016/j.csbj.2021.07.021. eCollection 2021.

A primer on machine learning techniques for genomic applications

Affiliations
Review

A primer on machine learning techniques for genomic applications

Alfonso Monaco et al. Comput Struct Biotechnol J. .

Abstract

High throughput sequencing technologies have enabled the study of complex biological aspects at single nucleotide resolution, opening the big data era. The analysis of large volumes of heterogeneous "omic" data, however, requires novel and efficient computational algorithms based on the paradigm of Artificial Intelligence. In the present review, we introduce and describe the most common machine learning methodologies, and lately deep learning, applied to a variety of genomics tasks, trying to emphasize capabilities, strengths and limitations through a simple and intuitive language. We highlight the power of the machine learning approach in handling big data by means of a real life example, and underline how described methods could be relevant in all cases in which large amounts of multimodal genomic data are available.

Keywords: Deep learning; Genomics; Machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
Supervised versus unsupervised learning – a pictorial representation. Supervised learning involves a training phase in which a labeled dataset is used to train the model that will subsequently be able to recognize unseen data. Unsupervised learning identifies latent factors in unmarked data and groups them based on similarity.
Fig. 2
Fig. 2
Examples of under-, appropriate and over-fitting. The input dataset consists of two classes (blue and red points) that are distributed on an inner and outer circle, respectively (with Gaussian noise) in a two dimensional feature space. Three different models are used to fit the input training set, or in other words, to find the boundary (in black) that can best separate the two classes under the model hypotheses. The training set is represented by circles, the test set by triangles. The chosen methods, Logistic Regression, Support Vector Machine with Gaussian kernel, and K-NN lead to typical over-, appropriate, and under-fitting scenarios, respectively. The code used to generate this figure is available on Github.
Fig. 3
Fig. 3
Comparison of different classifiers on a two-class (red and blue) classification problem with two features. (top) Linearly separable synthetic dataset (with noise). (bottom) Circularly separable synthetic dataset (with noise). Circles represent the training set, triangles the test set. The code used to generate this figure is available on Github (code adapted from [34]).
Fig. 4
Fig. 4
SVM optimal boundary for linearly separable data: SVM uses margins, whose distance from the decision boundary can be tuned by changing the amount of penalization. Only observations inside the margin contribute to the loss, and therefore to the definition of the optimal hyperplane.
Fig. 5
Fig. 5
Schematic view of an artificial neuron: x1,…, xM are the input nodes, y is the output node, F is the activation function.
Fig. 6
Fig. 6
Comparison of different activation functions. The binary step or 0–1 activation function (AF) activates the neuron when the input is greater than a threshold. While useful in binary classification tasks, it is of limited use in multiclass classification problems. The sigmoid and tanh AFs are similar in shape to the 0–1 AF but are continuosly differentiable which is essential for gradient based optimisation. The ReLU AF is computationally more efficient compared to the sigmoid and tanh function because it only activates neurons with positive input values – a subset of the nodes in the neural network (in the same way as the SVM algorithm is efficient because only support vectors contribute to the loss function). A differentiable version of the ReLU activation function exists that is smooth in 0.
Fig. 7
Fig. 7
Random Forest architecture for classification and regression problems.

References

    1. McCarthy J. Basic questions. What is Artificial Intelligence?http://www-formal.stanford.edu/jmc/whatisai.html.
    1. Reuter J.A., Spacek D.V., Snyder M.P. High-throughput sequencing technologies. Mol Cell. 2015;58(4):586–597. doi: 10.1016/j.molcel.2015.05.004. - DOI - PMC - PubMed
    1. Horner D.S., Pavesi G., Castrignanò T., De Meo P.D., Liuni S., Sammeth M., Picardi E., Pesole G. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinf. 2010;11(2):181–197. doi: 10.1093/bib/bbp046. Epub 2009 Oct 27. - DOI - PubMed
    1. Mardis E.R. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
    1. Tattini L., D’Aurizio R., Magi A. Detection of genomic structural variants from next-generation sequencing data. Front Bioeng Biotechnol. 2015;25(3):92. doi: 10.3389/fbioe.2015.00092. - DOI - PMC - PubMed

LinkOut - more resources