Using deep learning to annotate the protein universe
- PMID: 35190689
- DOI: 10.1038/s41587-021-01179-w
Using deep learning to annotate the protein universe
Abstract
Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.
© 2022. The Author(s), under exclusive licence to Springer Nature America, Inc.
Similar articles
-
Transfer learning: The key to functionally annotate the protein universe.Patterns (N Y). 2023 Feb 10;4(2):100691. doi: 10.1016/j.patter.2023.100691. eCollection 2023 Feb 10. Patterns (N Y). 2023. PMID: 36873903 Free PMC article.
-
Uncovering new families and folds in the natural protein universe.Nature. 2023 Oct;622(7983):646-653. doi: 10.1038/s41586-023-06622-3. Epub 2023 Sep 13. Nature. 2023. PMID: 37704037 Free PMC article.
-
The challenge of increasing Pfam coverage of the human proteome.Database (Oxford). 2013 Apr 19;2013:bat023. doi: 10.1093/database/bat023. Print 2013. Database (Oxford). 2013. PMID: 23603847 Free PMC article.
-
RIL-Contour: a Medical Imaging Dataset Annotation Tool for and with Deep Learning.J Digit Imaging. 2019 Aug;32(4):571-581. doi: 10.1007/s10278-019-00232-0. J Digit Imaging. 2019. PMID: 31089974 Free PMC article. Review.
-
Proteome analysis using machine learning approaches and its applications to diseases.Adv Protein Chem Struct Biol. 2021;127:161-216. doi: 10.1016/bs.apcsb.2021.02.003. Epub 2021 Mar 24. Adv Protein Chem Struct Biol. 2021. PMID: 34340767 Review.
Cited by
-
Highly accurate classification and discovery of microbial protein-coding gene functions using FunGeneTyper: an extensible deep learning framework.Brief Bioinform. 2024 May 23;25(4):bbae319. doi: 10.1093/bib/bbae319. Brief Bioinform. 2024. PMID: 39007592 Free PMC article.
-
GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction.Biomolecules. 2022 Nov 18;12(11):1709. doi: 10.3390/biom12111709. Biomolecules. 2022. PMID: 36421723 Free PMC article.
-
Generative power of a protein language model trained on multiple sequence alignments.Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854. Elife. 2023. PMID: 36734516 Free PMC article.
-
Deep embeddings to comprehend and visualize microbiome protein space.Sci Rep. 2022 Jun 20;12(1):10332. doi: 10.1038/s41598-022-14055-7. Sci Rep. 2022. PMID: 35725732 Free PMC article.
-
Towards mechanistic models of mutational effects: Deep learning on Alzheimer's Aβ peptide.Comput Struct Biotechnol J. 2023 Mar 31;21:2434-2445. doi: 10.1016/j.csbj.2023.03.051. eCollection 2023. Comput Struct Biotechnol J. 2023. PMID: 37090430 Free PMC article.
References
-
- Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017). - DOI
-
- Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018). - DOI
-
- Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2004). - DOI
-
- Biegert, A. & Söding, J. Sequence context-specific profiles for homology searching. Proc. Natl Acad. Sci. USA 106, 3770–3775 (2009). - DOI
-
- Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011). - DOI
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases