. 2015 Nov 10;10(11):e0141287.

doi: 10.1371/journal.pone.0141287. eCollection 2015.

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Ehsaneddin Asgari¹, Mohammad R K Mofrad^{1

2}

Affiliations

¹ Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America.
² Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America.

PMID: 26555596
PMCID: PMC4640716
DOI: 10.1371/journal.pone.0141287

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Ehsaneddin Asgari et al. PLoS One. 2015.

. 2015 Nov 10;10(11):e0141287.

doi: 10.1371/journal.pone.0141287. eCollection 2015.

Authors

Ehsaneddin Asgari¹, Mohammad R K Mofrad^{1

2}

Affiliations

¹ Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America.
² Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America.

PMID: 26555596
PMCID: PMC4640716
DOI: 10.1371/journal.pone.0141287

Abstract

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Fig 1. Protein sequence splitting.**
In order to prepare the training data, each protein sequence will be represented as three sequences (1, 2, 3) of 3-grams.

**Fig 2. Normalized distributions of biochemical and biophysical properties in protein-space.**
In these plots, each point represents a 3-gram (a word of three residues) and the colors indicate the scale for each property. Data points in these plots are projected from a 100-dimensional space a 2D space using t-SNE. As it is shown words with similar properties are automatically clustered together meaning that the properties are smoothly distributed in this space.

**Fig 3. Visualization of protein sequences using ProtVec can characterize FGNUPs versus Disport disordered sequences and structured sequences.**
Column (a) compares FG Nup sequences 2D histogram (at the bottom) with 2D histogram of FG Nup disordered regions (on top). Column (b) compares 2D histogram two random sets of structured sequences with the same average length as the FG-Nups. Column (c) compares between 2D histogram of DisProt sequences (at the bottom) and 2D histogram of DisProt disordered regions (on top).

**Fig 4. Classification of FG-Nups versus PDB structured sequences.**
In this figure, each point presents a protein projected into a 2D space.

See this image and copyright information in PMC

Cited by

Learning the molecular grammar of protein condensates from sequence determinants and embeddings.
Saar KL, Morgunov AS, Qi R, Arter WE, Krainer G, Lee AA, Knowles TPJ. Saar KL, et al. Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):e2019053118. doi: 10.1073/pnas.2019053118. Proc Natl Acad Sci U S A. 2021. PMID: 33827920 Free PMC article.
Incorporating Machine Learning into Established Bioinformatics Frameworks.
Auslander N, Gussow AB, Koonin EV. Auslander N, et al. Int J Mol Sci. 2021 Mar 12;22(6):2903. doi: 10.3390/ijms22062903. Int J Mol Sci. 2021. PMID: 33809353 Free PMC article. Review.
PITHIA: Protein Interaction Site Prediction Using Multiple Sequence Alignments and Attention.
Hosseini S, Ilie L. Hosseini S, et al. Int J Mol Sci. 2022 Oct 24;23(21):12814. doi: 10.3390/ijms232112814. Int J Mol Sci. 2022. PMID: 36361606 Free PMC article.
A deep learning genome-mining strategy for biosynthetic gene cluster prediction.
Hannigan GD, Prihoda D, Palicka A, Soukup J, Klempir O, Rampula L, Durcak J, Wurst M, Kotowski J, Chang D, Wang R, Piizzi G, Temesi G, Hazuda DJ, Woelk CH, Bitton DA. Hannigan GD, et al. Nucleic Acids Res. 2019 Oct 10;47(18):e110. doi: 10.1093/nar/gkz654. Nucleic Acids Res. 2019. PMID: 31400112 Free PMC article.
ProtPlat: an efficient pre-training platform for protein classification based on FastText.
Jin Y, Yang Y. Jin Y, et al. BMC Bioinformatics. 2022 Feb 11;23(1):66. doi: 10.1186/s12859-022-04604-2. BMC Bioinformatics. 2022. PMID: 35148686 Free PMC article.

See all "Cited by" articles

References

1. Yandell MD, Majoros WH. Genomics and natural language processing. Nature Reviews Genetics. 2002;3(8):601–610. - PubMed
1. Searls DB. The language of genes. Nature. 2002;420(6912):211–217. 10.1038/nature01255 - DOI - PubMed
1. Motomura K, Fujita T, Tsutsumi M, Kikuzato S, Nakamura M, Otaki JM. Word decoding of protein amino acid sequences with availability analysis: a linguistic approach. PloS one. 2012;7(11):e50039 10.1371/journal.pone.0050039 - DOI - PMC - PubMed
1. Cai Y, Lux MW, Adam L, Peccoud J. Modeling structure-function relationships in synthetic DNA sequences using attribute grammars. PLoS Comput Biol. 2009;5(10):e1000529 10.1371/journal.pcbi.1000529 - DOI - PMC - PubMed
1. Suykens JA, Vandewalle J. Least squares support vector machine classifiers. Neural processing letters. 1999;9(3):293–300. 10.1023/A:1018628609742 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Affiliations

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources