An introduction to deep learning on biological sequence data: examples and solutions

Vanessa Isabell Jurtz¹, Alexander Rosenberg Johansen², Morten Nielsen^{1

3}, Jose Juan Almagro Armenteros¹, Henrik Nielsen¹, Casper Kaae Sønderby⁴, Ole Winther^{2

4}, Søren Kaae Sønderby⁴

Affiliations

¹ Department of Bio and Health Informatics.
² Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark.
³ Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, Buenos Aires, Argentina.
⁴ Department of Biology, University of Copenhagen, Copenhagen, Denmark.

PMID: 28961695
PMCID: PMC5870575
DOI: 10.1093/bioinformatics/btx531

An introduction to deep learning on biological sequence data: examples and solutions

Vanessa Isabell Jurtz et al. Bioinformatics. 2017.

. 2017 Nov 15;33(22):3685-3690.

doi: 10.1093/bioinformatics/btx531.

Authors

Affiliations

¹ Department of Bio and Health Informatics.
² Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark.
³ Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, Buenos Aires, Argentina.
⁴ Department of Biology, University of Copenhagen, Copenhagen, Denmark.

PMID: 28961695
PMCID: PMC5870575
DOI: 10.1093/bioinformatics/btx531

Abstract

Motivation: Deep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training of neural networks are the drivers of this development. The use of deep learning has been especially successful in image recognition; and the development of tools, applications and code examples are in most cases centered within this field rather than within biology.

Results: Here, we aim to further the development of deep learning methods within biology by providing application examples and ready to apply and adapt code templates. Given such examples, we illustrate how architectures consisting of convolutional and long short-term memory neural networks can relatively easily be designed and trained to state-of-the-art performance on three biological sequence problems: prediction of subcellular localization, protein secondary structure and the binding of peptides to MHC Class II molecules.

Availability and implementation: All implementations and datasets are available online to the scientific community at https://github.com/vanessajurtz/lasagne4bio.

Contact: skaaesonderby@gmail.com.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
(A) Feed forward network. Amino Acids C, A, D, A, D are encoded as ‘one-hot’ vectors with a 1 at the position corresponding to the amino acid type (A, C or D), and zero otherwise. (B) Convolutional neural network. A filter (blue) is slid over the input sequence. The filter here has a length of three amino acids. At each position the filter has a preference for different amino acid types. The filter output is calculated by taking the sum of the element-wise product of the input and the filter position-specific weights. Each time the filter is moved, it feeds into a different hidden neuron in the hidden layer, here visualized in the f1 row. Multiple filters will give multiple inputs to the next layer {f1, f2, f3, …}. (C) A filter can be visualized as a sequence motif. This helps to understand which amino acids the filter prefers at each sequence position. When the filter is slid over the input sequence, it functions as motif detector and becomes activated when the input matches its preference. For example, this filter has negative output for sub-sequences ADC and positive for DCD

**Fig. 2.**
(A) Schematic illustration of subcellular localization classification. (B) The neural network architecture used to predict the subcellular localization of proteins. (C) Visualization of the positions within the protein amino acid sequence that have high importance for the prediction of subcellular localization. Sequence position importance is determined by an attention function and the middle part of the protein sequences have been cut out in order to align N- and C-terminus. The different subcellular localization classes are shown on the y-axis. (D) Table of the A-LSTM performance compared to the state of the art sequence driven SVM prediction method MultiLoc. (E) Visualization of convolutional filter. For this filter charged amino acids will suppress the output (blue, red) while hydrophobic amino acids will increase the output (black). (C) and (D) are adapted from (Sønderby *et al.*, 2015)

**Fig. 3.**
(A) Visualization of the task of secondary structure prediction based on the protein amino acid sequence. (B) A flowchart showing the succession of different layers in our neural network model to predict protein secondary structure. The skip connection is implemented by concatenating the output of the CNN layer with amino acid input. (C) Performance of our model compared to the state of the art DeepCNF (Wang *et al.*, 2016) method

**Fig. 4.**
(A) MHCII molecules present peptides derived from the extracellular environment to T-helper cells. Here we predict which peptides are able to bind a given MHCII molecule, which is an important step on the way to identifying T-cell epitopes. (B) The CNN (left side) and LSTM (right side) architectures used to predict peptide binding to MHCII molecules. (C) Performance per MHCII allele of NetMHCIIpan-3.0, CNN + LSTM and the consensus method (NetMHCIIpan-3.0 and CNN + LSTM) on the evaluation set

See this image and copyright information in PMC

References

1. Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed
1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Andreatta M. et al. (2011) NNAlign: a web-based prediction method allowing non-expert end-user discovery of sequence motifs in quantitative peptide data. PLoS One, 6, e26781.. - PMC - PubMed
1. Bahdanau D. et al. (2015) Neural Machine Translation by Jointly Learning to Align and Translate. Proceedings of International Conference on Learning Representations (ICLR), arXiv preprint arXiv:1312.6077.
1. Bastien F. et al. (2016) Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An introduction to deep learning on biological sequence data: examples and solutions

Affiliations

An introduction to deep learning on biological sequence data: examples and solutions

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials