Machine learning applications in genetics and genomics

Maxwell W Libbrecht¹, William Stafford Noble²

Affiliations

¹ Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA.
² 1] Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA. [2] Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195-5065, USA.

PMID: 25948244
PMCID: PMC5204302
DOI: 10.1038/nrg3920

Review

Machine learning applications in genetics and genomics

Maxwell W Libbrecht et al. Nat Rev Genet. 2015 Jun.

. 2015 Jun;16(6):321-32.

doi: 10.1038/nrg3920. Epub 2015 May 7.

Authors

Maxwell W Libbrecht¹, William Stafford Noble²

Affiliations

¹ Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA.
² 1] Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA. [2] Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195-5065, USA.

PMID: 25948244
PMCID: PMC5204302
DOI: 10.1038/nrg3920

Abstract

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

PubMed Disclaimer

Figures

**Figure 1. Machine learning**
A canonical example of a machine learning application. A training set of DNA sequences is provided as input to a learning procedure, along with binary labels indicating whether each sequence is centered on a TSS or not. The learning algorithm produces a model which can then be subsequently used, in conjunction with a prediction algorithm, to assign predicted labels to unlabeled test sequences.

**Figure 2. A gene finding model**
A simplified gene finding model, which captures the basic properties of a protein-coding gene. The model takes as input the DNA sequence of a chromosome or a portion thereof and produces as output detailed gene annotations. Note that this simplified model is incapable of identifying overlapping genes or multiple isoforms of the same gene.

**Figure 3. Two models of TF binding**
(A) A position-specific frequency matrix model, in which the entry in row i and column j represents the frequency of the ith base occurring at position j in the training set. (B) A linear support vector machine model of TF binding. Labeled positive and negative training examples are provided as input, and a learning procedure adjusts the weights on the edges to predict the given label. (C) The figure plots the mean accuracy (± 95% confidence intervals), on a set of 500 simulated test sets, of predicting TF binding as a function of the number of training examples. The two series correspond to a PSFM model or an SVM. (D) Schematic of generative versus discriminative modeling. The generative model characterizes both classes completely, whereas the discriminative model focuses on the boundary between the classes.

**Figure 4. Incorporating a prior into a PSFM**
A simple, principled method for putting a probabilistic prior on a PSFM involves augmenting the observed nucleotide counts with “pseudocounts” and then computing frequencies with respect to the sum. The magnitude of the pseudocount corresponds to the weight assigned to the prior.

**Figure 5. Three ways to accommodate heterogeneous data in machine learning**
The task of predicting gene function labels requires methods that take as input gene expression data, protein sequences, protein-protein interaction networks, etc. These diverse data types can be encoded into fixed-length features, represented using pairwise similarities (kernels), or directly accommodated by a probability model.

**Figure 6. Inferring network structure**
Methods that infer each relationship in a network separately, such as computing the correlation between each pair, can be confounded by indirect relationships. Methods that infer the network as a whole can identify only direct relationships. Inferring the direction of causality inherent in networks is generally more challenging than inferring the network structure [68], so many network inference methods, such as Gaussian graphical model learning, infer only the network.

See this image and copyright information in PMC

References

1. Mitchell T. Machine Learning. McGraw-Hill; 1997.
1. Ohler W, Liao C, Niemann H, Rubin GM. Computational analysis of core promoters in the drosophila genome. Genome Biology. 2002;3 - PMC - PubMed
1. Degroeve S, Baets BD, de Peer YV, Rouz P. Feature subset selection for splice site prediction. Bioinformatics. 2002;18:S75–S83. - PubMed
1. Bucher P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. Journal of Molecular Biology. 1990;4:563–578. - PubMed
1. Heintzman N, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genetics. 2007;39:311–318. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 CA180777/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine learning applications in genetics and genomics

Affiliations

Machine learning applications in genetics and genomics

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources