. 2018 Aug 1;34(15):2642-2648.

doi: 10.1093/bioinformatics/bty178.

Learned protein embeddings for machine learning

Kevin K Yang¹, Zachary Wu¹, Claire N Bedbrook², Frances H Arnold^{1

2}

Affiliations

¹ Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
² Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.

PMID: 29584811
PMCID: PMC6061698
DOI: 10.1093/bioinformatics/bty178

Learned protein embeddings for machine learning

Kevin K Yang et al. Bioinformatics. 2018.

. 2018 Aug 1;34(15):2642-2648.

doi: 10.1093/bioinformatics/bty178.

Authors

Kevin K Yang¹, Zachary Wu¹, Claire N Bedbrook², Frances H Arnold^{1

2}

Affiliations

¹ Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
² Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.

PMID: 29584811
PMCID: PMC6061698
DOI: 10.1093/bioinformatics/bty178

Erratum in

Learned protein embeddings for machine learning.
Yang KK, Wu Z, Bedbrook CN, Arnold FH. Yang KK, et al. Bioinformatics. 2018 Dec 1;34(23):4138. doi: 10.1093/bioinformatics/bty455. Bioinformatics. 2018. PMID: 29933431 Free PMC article. No abstract available.

Abstract

Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.

Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.

Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
The modeling scheme. First, an unsupervised embedding model is trained on 524 529 unlabeled sequences pulled from the UniProt database. The UniProt sequences are broken into k lists of non-overlapping k-mers (Step 1), and then the lists are used to train the embedding model (Step 2). The doc2vec embedding model learns to predict the vectors for center k-mers from the vectors for their surrounding context k-mers and the sequence vectors. These sequence vectors are then the embedded representations of the sequences. Next, information learned during the unsupervised phase is applied during supervised learning with labeled sequences. The labeled sequences for each task (localization, T50, absorption and enantioselectivity) are first broken into k lists of non-overlapping k-mers (Step 3). An embedding is then inferred for each sequence using the trained embedding model (Step 4). n is the number of labeled sequences. Finally, during GP regression (Step 5), the inferred training embeddings X’ and the training labels y are used to train a GP regression model, which can then be used to make predictions

**Fig. 2.**
Effect of embedding dimension on predictive accuracy. For each task, embeddings of varying dimensions were trained and then used for GP regression. The resulting model quality was then evaluated using the Kendall τ and MAE

**Fig. 3.**
Effect of number of unlabeled sequences on predictive accuracy. For each task, embeddings were trained on subsets of the UniProt sequences and then used for GP regression. The resulting model quality was then evaluated using the Kendall τ and MAE

**Fig. 4.**
Visualization of learned vector representations of protein sequences. Vector representations projected onto 2 dimensions using t-SNE with perplexity 50 (embeddings, AAIndex, sequence) or 10 (ProFET). The sequences for the localization, the T50 and the enantioselectivity tasks are colored by the number of mutations from the nearest parent. The sequences for the absorption task are colored by peak absorption wavelength. Parents for localization, T50 and enantioselectivity are indicated by red triangles

**Fig. 5.**
Combined visualization of vector representations for each of the four tasks. Sequences are colored to show separation between the embeddings for each task (Color version of this figure is available at *Bioinformatics* online.)

See this image and copyright information in PMC

Cited by

Antibody design using LSTM based deep generative model from phage display library for affinity maturation.
Saka K, Kakuzaki T, Metsugi S, Kashiwagi D, Yoshida K, Wada M, Tsunoda H, Teramoto R. Saka K, et al. Sci Rep. 2021 Mar 12;11(1):5852. doi: 10.1038/s41598-021-85274-7. Sci Rep. 2021. PMID: 33712669 Free PMC article.
PEZy-miner: An artificial intelligence driven approach for the discovery of plastic-degrading enzyme candidates.
Jiang R, Yue Z, Shang L, Wang D, Wei N. Jiang R, et al. Metab Eng Commun. 2024 Sep 5;19:e00248. doi: 10.1016/j.mec.2024.e00248. eCollection 2024 Dec. Metab Eng Commun. 2024. PMID: 39310048 Free PMC article.
Neural network extrapolation to distant regions of the protein fitness landscape.
Fahlberg SA, Freschlin CR, Heinzelman P, Romero PA. Fahlberg SA, et al. bioRxiv [Preprint]. 2023 Nov 9:2023.11.08.566287. doi: 10.1101/2023.11.08.566287. bioRxiv. 2023. Update in: Nat Commun. 2024 Jul 30;15(1):6405. doi: 10.1038/s41467-024-50712-3. PMID: 37987009 Free PMC article. Updated. Preprint.
Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods.
Qiu WR, Guan MY, Wang QK, Lou LL, Xiao X. Qiu WR, et al. Front Endocrinol (Lausanne). 2022 Apr 26;13:849549. doi: 10.3389/fendo.2022.849549. eCollection 2022. Front Endocrinol (Lausanne). 2022. PMID: 35557849 Free PMC article.
DTI-BERT: Identifying Drug-Target Interactions in Cellular Networking Based on BERT and Deep Learning Method.
Zheng J, Xiao X, Qiu WR. Zheng J, et al. Front Genet. 2022 Jun 8;13:859188. doi: 10.3389/fgene.2022.859188. eCollection 2022. Front Genet. 2022. PMID: 35754843 Free PMC article.

See all "Cited by" articles

References

1. Abbasi W.A., Minhas F.U.A.A. (2016) Issues in performance evaluation for host-pathogen protein interaction prediction. J. Bioinform. Comput. Biol., 14, 1650011.. - PubMed
1. Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed
1. Asgari E., Mofrad M.R.K. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287.. - PMC - PubMed
1. Bedbrook C.N. et al. (2017a) Structure-guided SCHEMA recombination generates diverse chimeric channelrhodopsins. Proc. Natl. Acad. Sci. USA, 114, E2624–E2633. - PMC - PubMed
1. Bedbrook C.N. et al. (2017b) Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLOS Comput. Biol., 13, e1005786. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

F31 MH102913/MH/NIMH NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learned protein embeddings for machine learning

Affiliations

Learned protein embeddings for machine learning

Authors

Affiliations

Erratum in

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources