Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 1;34(15):2642-2648.
doi: 10.1093/bioinformatics/bty178.

Learned protein embeddings for machine learning

Affiliations

Learned protein embeddings for machine learning

Kevin K Yang et al. Bioinformatics. .

Erratum in

  • Learned protein embeddings for machine learning.
    Yang KK, Wu Z, Bedbrook CN, Arnold FH. Yang KK, et al. Bioinformatics. 2018 Dec 1;34(23):4138. doi: 10.1093/bioinformatics/bty455. Bioinformatics. 2018. PMID: 29933431 Free PMC article. No abstract available.

Abstract

Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.

Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.

Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The modeling scheme. First, an unsupervised embedding model is trained on 524 529 unlabeled sequences pulled from the UniProt database. The UniProt sequences are broken into k lists of non-overlapping k-mers (Step 1), and then the lists are used to train the embedding model (Step 2). The doc2vec embedding model learns to predict the vectors for center k-mers from the vectors for their surrounding context k-mers and the sequence vectors. These sequence vectors are then the embedded representations of the sequences. Next, information learned during the unsupervised phase is applied during supervised learning with labeled sequences. The labeled sequences for each task (localization, T50, absorption and enantioselectivity) are first broken into k lists of non-overlapping k-mers (Step 3). An embedding is then inferred for each sequence using the trained embedding model (Step 4). n is the number of labeled sequences. Finally, during GP regression (Step 5), the inferred training embeddings X’ and the training labels y are used to train a GP regression model, which can then be used to make predictions
Fig. 2.
Fig. 2.
Effect of embedding dimension on predictive accuracy. For each task, embeddings of varying dimensions were trained and then used for GP regression. The resulting model quality was then evaluated using the Kendall τ and MAE
Fig. 3.
Fig. 3.
Effect of number of unlabeled sequences on predictive accuracy. For each task, embeddings were trained on subsets of the UniProt sequences and then used for GP regression. The resulting model quality was then evaluated using the Kendall τ and MAE
Fig. 4.
Fig. 4.
Visualization of learned vector representations of protein sequences. Vector representations projected onto 2 dimensions using t-SNE with perplexity 50 (embeddings, AAIndex, sequence) or 10 (ProFET). The sequences for the localization, the T50 and the enantioselectivity tasks are colored by the number of mutations from the nearest parent. The sequences for the absorption task are colored by peak absorption wavelength. Parents for localization, T50 and enantioselectivity are indicated by red triangles
Fig. 5.
Fig. 5.
Combined visualization of vector representations for each of the four tasks. Sequences are colored to show separation between the embeddings for each task (Color version of this figure is available at Bioinformatics online.)

Similar articles

Cited by

References

    1. Abbasi W.A., Minhas F.U.A.A. (2016) Issues in performance evaluation for host-pathogen protein interaction prediction. J. Bioinform. Comput. Biol., 14, 1650011.. - PubMed
    1. Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed
    1. Asgari E., Mofrad M.R.K. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287.. - PMC - PubMed
    1. Bedbrook C.N. et al. (2017a) Structure-guided SCHEMA recombination generates diverse chimeric channelrhodopsins. Proc. Natl. Acad. Sci. USA, 114, E2624–E2633. - PMC - PubMed
    1. Bedbrook C.N. et al. (2017b) Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLOS Comput. Biol., 13, e1005786. - PMC - PubMed

Publication types