. 2022 Jul 7:10:788300.

doi: 10.3389/fbioe.2022.788300. eCollection 2022.

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field

Affiliations

¹ Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico.
² Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico.
³ Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico.
⁴ Instituto Nacional de Pediatría, Mexico City, Mexico.

PMID: 35875501
PMCID: PMC9301016
DOI: 10.3389/fbioe.2022.788300

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field

Jalil Villalobos-Alva et al. Front Bioeng Biotechnol. 2022.

. 2022 Jul 7:10:788300.

doi: 10.3389/fbioe.2022.788300. eCollection 2022.

Affiliations

¹ Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico.
² Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico.
³ Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico.
⁴ Instituto Nacional de Pediatría, Mexico City, Mexico.

PMID: 35875501
PMCID: PMC9301016
DOI: 10.3389/fbioe.2022.788300

Abstract

Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.

Keywords: artificial intelligence; deep learning; drug design; machine learning; protein classification; protein design and engineering; protein prediction; proteins.

Copyright © 2022 Villalobos-Alva, Ochoa-Toledo, Villalobos-Alva, Aliseda, Pérez-Escamirosa, Altamirano-Bustamante, Ochoa-Fernández, Zamora-Solís, Villalobos-Alva, Revilla-Monsalve, Kemper-Valverde and Altamirano-Bustamante.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
A representative decision diagram showing the articles retrieved using the PIO strategy in the PubMed database. P (participants): Protein, Protein Design, Scaffold, Rational protein design, Biocatalysts. I (intervention): Networks: Neural networks, Recurrent neural networks, Networks LSTM/GRU, Convolutional neural network, Deep belief networks, Deep stacking networks C5.0; Genetic algorithms; Artificial intelligence; Decision trees; Classification; Prediction C&A; Software: Weka, RapidMiner, IBM Modeler; Programming Languages: Python, Java, OpenGL, C++ Shell; Development platform: Caffe Deep Learning, TensorFlow, IBM Distributed Deep Learning (DDL); Paradigm: Supervised Learning, Unsupervised learning, Reinforced learning, new function.

**FIGURE 2**
Flowchart of article scaffold. Representation of the process throughout the entire article. The biochemical meta-analysis consists of three main steps: the systematic review, the road map design, and the road map alignment. In the systematic review, the research question is formulated in order to set the basis and objectives of the project. It also includes the observation and synthesis of information obtained from a variety of articles and the correlation made between them. The latter followed by the quality evaluation of the collected information. The road map design consists of analyzing the outcome of the studies and classifying them, thus being able to interpret the information recollected and represent it through the usage of figures and tables. This aims to include a wide range of the state of the art or artificial intelligence. Finally, the road map alignment includes the final discussion and further changes for our understanding of protein science using AI and the resolution of possible protein science application targets.

**FIGURE 3**
Flowchart of the review process. A PRISMA flowchart of the systematic review on AI for protein sciences.

**FIGURE 4**
Machine Learning paradigms: superviser learning, unsupervised learning, reinforcement learning.

**FIGURE 5**
Machine learning and artificial intelligence applications to protein sciences. Information includes the number of studies, applications, databases, methods, and validation used.

**FIGURE 6**
Representation of the specific case for protein structure prediction in the supervised learning framework. Revealing the most common flow followed by the studies analyzed. From extraction, training data, feature extraction procedures and data continuity. Including the PDB database, the most common supervised algorithms, SVM, SVR, 3DCNN.

**FIGURE 7**
Representation of the whole AI process based on the selected protein application. The process amalgams several steps: protein application (protein design, protein classification, protein prediction, *etc*.), extraction (selection of database), transform (code development and filtering), and load (input of the training data) (ETL) for the training data and the feature extraction procedure is the building of the machine learning network. Outcome step and a proposal server application.

See this image and copyright information in PMC

Cited by

Artificial intelligence driven innovations in biochemistry: A review of emerging research frontiers.
Lateef Junaid MA. Lateef Junaid MA. Biomol Biomed. 2025 Mar 7;25(4):739-750. doi: 10.17305/bb.2024.11537. Biomol Biomed. 2025. PMID: 39819459 Free PMC article. Review.
Int&in: A machine learning-based web server for active split site identification in inteins.
Schmitz M, Ballestin JB, Liang J, Tomas F, Freist L, Voigt K, Di Ventura B, Öztürk MA. Schmitz M, et al. Protein Sci. 2024 Jun;33(6):e4985. doi: 10.1002/pro.4985. Protein Sci. 2024. PMID: 38717278 Free PMC article.

References

1. Adhikari B., Hou J., Cheng J. (2018). DNCON2: Improved Protein Contact Prediction Using Two-Level Deep Convolutional Neural Networks. BioInformatics 34, 1466–1472. 10.1093/bioinformatics/btx781 - DOI - PMC - PubMed
1. Al-Gharabli S. I., Agtash S. A., Rawashdeh N. A., Barqawi K. R. (2015). Artificial Neural Networks for Dihedral Angles Prediction in Enzyme Loops: A Novel Approach. Ijbra 11, 153–161. 10.1504/IJBRA.2015.068090 - DOI - PubMed
1. Alakuş T. B., Türkoğlu İ. (2021). A Novel Fibonacci Hash Method for Protein Family Identification by Using Recurrent Neural Networks. Turk. J. Electr. Eng. Comput. Sci. 29, 370–386. Available at: http://10.0.15.66/elk-2003-116 . 10.0.15.66/elk-2003-116 - DOI - DOI
1. Almagro Armenteros J. J., Sønderby C. K., Sønderby S. K., Nielsen H., Winther O. (2017). DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning. Bioinformatics 33, 3387–3395. 10.1093/bioinformatics/btx431 - DOI - PubMed
1. AlQuraishi M. (2021). Machine Learning in Protein Structure Prediction. Curr. Opin. Chem. Biol. 65, 1–8. 10.1016/j.cbpa.2021.04.005 - DOI - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field

Affiliations

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous