Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2015 Jun;16(6):321-32.
doi: 10.1038/nrg3920. Epub 2015 May 7.

Machine learning applications in genetics and genomics

Affiliations
Review

Machine learning applications in genetics and genomics

Maxwell W Libbrecht et al. Nat Rev Genet. 2015 Jun.

Abstract

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Machine learning
A canonical example of a machine learning application. A training set of DNA sequences is provided as input to a learning procedure, along with binary labels indicating whether each sequence is centered on a TSS or not. The learning algorithm produces a model which can then be subsequently used, in conjunction with a prediction algorithm, to assign predicted labels to unlabeled test sequences.
Figure 2
Figure 2. A gene finding model
A simplified gene finding model, which captures the basic properties of a protein-coding gene. The model takes as input the DNA sequence of a chromosome or a portion thereof and produces as output detailed gene annotations. Note that this simplified model is incapable of identifying overlapping genes or multiple isoforms of the same gene.
Figure 3
Figure 3. Two models of TF binding
(A) A position-specific frequency matrix model, in which the entry in row i and column j represents the frequency of the ith base occurring at position j in the training set. (B) A linear support vector machine model of TF binding. Labeled positive and negative training examples are provided as input, and a learning procedure adjusts the weights on the edges to predict the given label. (C) The figure plots the mean accuracy (± 95% confidence intervals), on a set of 500 simulated test sets, of predicting TF binding as a function of the number of training examples. The two series correspond to a PSFM model or an SVM. (D) Schematic of generative versus discriminative modeling. The generative model characterizes both classes completely, whereas the discriminative model focuses on the boundary between the classes.
Figure 4
Figure 4. Incorporating a prior into a PSFM
A simple, principled method for putting a probabilistic prior on a PSFM involves augmenting the observed nucleotide counts with “pseudocounts” and then computing frequencies with respect to the sum. The magnitude of the pseudocount corresponds to the weight assigned to the prior.
Figure 5
Figure 5. Three ways to accommodate heterogeneous data in machine learning
The task of predicting gene function labels requires methods that take as input gene expression data, protein sequences, protein-protein interaction networks, etc. These diverse data types can be encoded into fixed-length features, represented using pairwise similarities (kernels), or directly accommodated by a probability model.
Figure 6
Figure 6. Inferring network structure
Methods that infer each relationship in a network separately, such as computing the correlation between each pair, can be confounded by indirect relationships. Methods that infer the network as a whole can identify only direct relationships. Inferring the direction of causality inherent in networks is generally more challenging than inferring the network structure [68], so many network inference methods, such as Gaussian graphical model learning, infer only the network.

References

    1. Mitchell T. Machine Learning. McGraw-Hill; 1997.
    1. Ohler W, Liao C, Niemann H, Rubin GM. Computational analysis of core promoters in the drosophila genome. Genome Biology. 2002;3 - PMC - PubMed
    1. Degroeve S, Baets BD, de Peer YV, Rouz P. Feature subset selection for splice site prediction. Bioinformatics. 2002;18:S75–S83. - PubMed
    1. Bucher P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. Journal of Molecular Biology. 1990;4:563–578. - PubMed
    1. Heintzman N, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genetics. 2007;39:311–318. - PubMed