Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
- PMID: 26555596
- PMCID: PMC4640716
- DOI: 10.1371/journal.pone.0141287
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
Abstract
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.
Conflict of interest statement
Figures




Similar articles
-
Modeling aspects of the language of life through transfer-learning protein sequences.BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8. BMC Bioinformatics. 2019. PMID: 31847804 Free PMC article.
-
Prevalence and functionality of intrinsic disorder in human FG-nucleoporins.Int J Biol Macromol. 2021 Apr 1;175:156-170. doi: 10.1016/j.ijbiomac.2021.01.218. Epub 2021 Feb 3. Int J Biol Macromol. 2021. PMID: 33548309
-
Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model.Genes (Basel). 2019 Nov 12;10(11):924. doi: 10.3390/genes10110924. Genes (Basel). 2019. PMID: 31726752 Free PMC article.
-
A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction.Brief Bioinform. 2019 Jan 18;20(1):330-346. doi: 10.1093/bib/bbx126. Brief Bioinform. 2019. PMID: 30657889 Review.
-
Deep Learning and Its Applications in Biomedicine.Genomics Proteomics Bioinformatics. 2018 Feb;16(1):17-32. doi: 10.1016/j.gpb.2017.07.003. Epub 2018 Mar 6. Genomics Proteomics Bioinformatics. 2018. PMID: 29522900 Free PMC article. Review.
Cited by
-
Learning the molecular grammar of protein condensates from sequence determinants and embeddings.Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):e2019053118. doi: 10.1073/pnas.2019053118. Proc Natl Acad Sci U S A. 2021. PMID: 33827920 Free PMC article.
-
Incorporating Machine Learning into Established Bioinformatics Frameworks.Int J Mol Sci. 2021 Mar 12;22(6):2903. doi: 10.3390/ijms22062903. Int J Mol Sci. 2021. PMID: 33809353 Free PMC article. Review.
-
PITHIA: Protein Interaction Site Prediction Using Multiple Sequence Alignments and Attention.Int J Mol Sci. 2022 Oct 24;23(21):12814. doi: 10.3390/ijms232112814. Int J Mol Sci. 2022. PMID: 36361606 Free PMC article.
-
A deep learning genome-mining strategy for biosynthetic gene cluster prediction.Nucleic Acids Res. 2019 Oct 10;47(18):e110. doi: 10.1093/nar/gkz654. Nucleic Acids Res. 2019. PMID: 31400112 Free PMC article.
-
ProtPlat: an efficient pre-training platform for protein classification based on FastText.BMC Bioinformatics. 2022 Feb 11;23(1):66. doi: 10.1186/s12859-022-04604-2. BMC Bioinformatics. 2022. PMID: 35148686 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources