Review

. 2021 Jul 31:19:4345-4359.

doi: 10.1016/j.csbj.2021.07.021. eCollection 2021.

A primer on machine learning techniques for genomic applications

Alfonso Monaco¹, Ester Pantaleo², Nicola Amoroso^{1

3}, Antonio Lacalamita⁴, Claudio Lo Giudice⁵, Adriano Fonzino⁵, Bruno Fosso⁶, Ernesto Picardi^{5

6}, Sabina Tangaro^{1

7}, Graziano Pesole^{5

6}, Roberto Bellotti^{1

2}

Affiliations

¹ Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Bari, Via A. Orabona 4, 70125 Bari, Italy.
² Dipartimento Interateneo di Fisica "M. Merlin", Università degli Studi di Bari "Aldo Moro", Via G. Amendola 173, 70125 Bari, Italy.
³ Dipartimento di Farmacia - Scienze del Farmaco, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy.
⁴ National Institute of Gastroenterology "S. de Bellis", Research Hospital, 70013 Castellana Grotte (Bari), Italy.
⁵ Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy.
⁶ Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari, Consiglio Nazionale delle Ricerche, Via G. Amendola 122/O, 70126 Bari, Italy.
⁷ Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari "Aldo Moro", Bari, Via G. Amendola 165, 70125 Bari, Italy.

PMID: 34429852
PMCID: PMC8365460
DOI: 10.1016/j.csbj.2021.07.021

Review

A primer on machine learning techniques for genomic applications

Alfonso Monaco et al. Comput Struct Biotechnol J. 2021.

. 2021 Jul 31:19:4345-4359.

doi: 10.1016/j.csbj.2021.07.021. eCollection 2021.

Authors

Affiliations

¹ Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Bari, Via A. Orabona 4, 70125 Bari, Italy.
² Dipartimento Interateneo di Fisica "M. Merlin", Università degli Studi di Bari "Aldo Moro", Via G. Amendola 173, 70125 Bari, Italy.
³ Dipartimento di Farmacia - Scienze del Farmaco, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy.
⁴ National Institute of Gastroenterology "S. de Bellis", Research Hospital, 70013 Castellana Grotte (Bari), Italy.
⁵ Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari "Aldo Moro", Via A. Orabona 4, 70125 Bari, Italy.
⁶ Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari, Consiglio Nazionale delle Ricerche, Via G. Amendola 122/O, 70126 Bari, Italy.
⁷ Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari "Aldo Moro", Bari, Via G. Amendola 165, 70125 Bari, Italy.

PMID: 34429852
PMCID: PMC8365460
DOI: 10.1016/j.csbj.2021.07.021

Abstract

High throughput sequencing technologies have enabled the study of complex biological aspects at single nucleotide resolution, opening the big data era. The analysis of large volumes of heterogeneous "omic" data, however, requires novel and efficient computational algorithms based on the paradigm of Artificial Intelligence. In the present review, we introduce and describe the most common machine learning methodologies, and lately deep learning, applied to a variety of genomics tasks, trying to emphasize capabilities, strengths and limitations through a simple and intuitive language. We highlight the power of the machine learning approach in handling big data by means of a real life example, and underline how described methods could be relevant in all cases in which large amounts of multimodal genomic data are available.

Keywords: Deep learning; Genomics; Machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
Supervised versus unsupervised learning – a pictorial representation. Supervised learning involves a training phase in which a labeled dataset is used to train the model that will subsequently be able to recognize unseen data. Unsupervised learning identifies latent factors in unmarked data and groups them based on similarity.

**Fig. 2**
Examples of under-, appropriate and over-fitting. The input dataset consists of two classes (blue and red points) that are distributed on an inner and outer circle, respectively (with Gaussian noise) in a two dimensional feature space. Three different models are used to fit the input training set, or in other words, to find the boundary (in black) that can best separate the two classes under the model hypotheses. The training set is represented by circles, the test set by triangles. The chosen methods, Logistic Regression, Support Vector Machine with Gaussian kernel, and K-NN lead to typical over-, appropriate, and under-fitting scenarios, respectively. The code used to generate this figure is available on Github.

**Fig. 3**
Comparison of different classifiers on a two-class (red and blue) classification problem with two features. (top) Linearly separable synthetic dataset (with noise). (bottom) Circularly separable synthetic dataset (with noise). Circles represent the training set, triangles the test set. The code used to generate this figure is available on Github (code adapted from [34]).

**Fig. 4**
SVM optimal boundary for linearly separable data: SVM uses margins, whose distance from the decision boundary can be tuned by changing the amount of penalization. Only observations inside the margin contribute to the loss, and therefore to the definition of the optimal hyperplane.

**Fig. 5**
Schematic view of an artificial neuron: $x_{1}$ ,…, $x_{M}$ are the input nodes, y is the output node, F is the activation function.

**Fig. 6**
Comparison of different activation functions. The binary step or 0–1 activation function (AF) activates the neuron when the input is greater than a threshold. While useful in binary classification tasks, it is of limited use in multiclass classification problems. The sigmoid and tanh AFs are similar in shape to the 0–1 AF but are continuosly differentiable which is essential for gradient based optimisation. The ReLU AF is computationally more efficient compared to the sigmoid and tanh function because it only activates neurons with positive input values – a subset of the nodes in the neural network (in the same way as the SVM algorithm is efficient because only support vectors contribute to the loss function). A differentiable version of the ReLU activation function exists that is smooth in 0.

**Fig. 7**
Random Forest architecture for classification and regression problems.

See this image and copyright information in PMC

References

1. McCarthy J. Basic questions. What is Artificial Intelligence?http://www-formal.stanford.edu/jmc/whatisai.html.
1. Reuter J.A., Spacek D.V., Snyder M.P. High-throughput sequencing technologies. Mol Cell. 2015;58(4):586–597. doi: 10.1016/j.molcel.2015.05.004. - DOI - PMC - PubMed
1. Horner D.S., Pavesi G., Castrignanò T., De Meo P.D., Liuni S., Sammeth M., Picardi E., Pesole G. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinf. 2010;11(2):181–197. doi: 10.1093/bib/bbp046. Epub 2009 Oct 27. - DOI - PubMed
1. Mardis E.R. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
1. Tattini L., D’Aurizio R., Magi A. Detection of genomic structural variants from next-generation sequencing data. Front Bioeng Biotechnol. 2015;25(3):92. doi: 10.3389/fbioe.2015.00092. - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A primer on machine learning techniques for genomic applications

Affiliations

A primer on machine learning techniques for genomic applications

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources