High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

doi:10.1109/mlhpc54614.2021.00010

. 2021 Nov:2021:46-57.

doi: 10.1109/mlhpc54614.2021.00010. Epub 2021 Dec 27.

High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Mu Gao¹, Peik Lund-Andersen², Alex Morehead³, Sajid Mahmud³, Chen Chen³, Xiao Chen³, Nabin Giri³, Raj S Roy³, Farhan Quadir³, T Chad Effler⁴, Ryan Prout⁴, Subil Abraham⁴, Wael Elwasif⁴, N Quentin Haas⁴, Jeffrey Skolnick¹, Jianlin Cheng³, Ada Sedova⁴

Affiliations

¹ Georgia Institute of Technology, Atlanta, GA.
² University of Idaho, Moscow, ID.
³ University of Missouri, Columbia, MO.
⁴ Oak Ridge National Laboratory, Oak Ridge, TN.

PMID: 35112110
PMCID: PMC8802329
DOI: 10.1109/mlhpc54614.2021.00010

High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Mu Gao et al. Workshop Mach Learn HPC Environ. 2021 Nov.

. 2021 Nov:2021:46-57.

doi: 10.1109/mlhpc54614.2021.00010. Epub 2021 Dec 27.

Authors

Affiliations

¹ Georgia Institute of Technology, Atlanta, GA.
² University of Idaho, Moscow, ID.
³ University of Missouri, Columbia, MO.
⁴ Oak Ridge National Laboratory, Oak Ridge, TN.

PMID: 35112110
PMCID: PMC8802329
DOI: 10.1109/mlhpc54614.2021.00010

Abstract

Computational biology is one of many scientific disciplines ripe for innovation and acceleration with the advent of high-performance computing (HPC). In recent years, the field of machine learning has also seen significant benefits from adopting HPC practices. In this work, we present a novel HPC pipeline that incorporates various machine-learning approaches for structure-based functional annotation of proteins on the scale of whole genomes. Our pipeline makes extensive use of deep learning and provides computational insights into best practices for training advanced deep-learning models for high-throughput data such as proteomics data. We showcase methodologies our pipeline currently supports and detail future tasks for our pipeline to envelop, including large-scale sequence comparison using SAdLSA and prediction of protein tertiary structures using AlphaFold2.

Keywords: computational biology; deep learning; high-performance computing; machine learning; protein sequence alignment; protein structure prediction.

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of SAdLSA, a deep-learning algorithm for protein sequence alignment. The symbol ⊗ denotes outer concatenation of two embedding vectors

**Fig. 2.**
Scheme for SAdLSA deployment at scale on Summit

**Fig. 3.**
Performance of SAdLSA on large sets of proteins and scaling across Summit. A: Runtime, in seconds, against the Pfam database for each sequence using one node. B: Runtime, in seconds, against the Pfam database for each sequence using one thousand nodes. Parallelization was over the Pfam database for each sequence input for inference calculation.

**Fig. 4.**
Performance of SAdLSA on the PDB70 database on Summit. Top: Weak scaling– total time in minutes for a given number of sequences to be aligned to the PDB70 using 12 sequences, 96 nodes, and 120 sequences, 1000 nodes. The set of 12 was chosen to have a similar length distribution ad the set of 120. Bottom: Runtime, in seconds, against the PDB70 database for each sequence in the 120 sequence set used in Figure 3 above, using one thousand nodes. Parallelization was over the PDB70 database for each sequence input for inference calculation.

**Fig. 5.**
Scheme of the model parallelization method on a Summit node for domain boundary prediction

**Fig. 6.**
Simplified scheme for annotation pipeline with consensus using our toolbox.

See this image and copyright information in PMC

Cited by

De Novo Atomic Protein Structure Modeling for Cryo-EM Density Maps Using 3D Transformer and Hidden Markov Model.
Giri N, Cheng J. Giri N, et al. bioRxiv [Preprint]. 2024 Jan 2:2024.01.02.573943. doi: 10.1101/2024.01.02.573943. bioRxiv. 2024. Update in: Nat Commun. 2024 Jun 29;15(1):5511. doi: 10.1038/s41467-024-49647-6. PMID: 38260535 Free PMC article. Updated. Preprint.
De novo atomic protein structure modeling for cryoEM density maps using 3D transformer and HMM.
Giri N, Cheng J. Giri N, et al. Nat Commun. 2024 Jun 29;15(1):5511. doi: 10.1038/s41467-024-49647-6. Nat Commun. 2024. PMID: 38951555 Free PMC article.
AF2Complex predicts direct physical interactions in multimeric proteins with deep learning.
Gao M, Nakajima An D, Parks JM, Skolnick J. Gao M, et al. Nat Commun. 2022 Apr 1;13(1):1744. doi: 10.1038/s41467-022-29394-2. Nat Commun. 2022. PMID: 35365655 Free PMC article.
Multi-head attention-based U-Nets for predicting protein domain boundaries using 1D sequence features and 2D distance maps.
Mahmud S, Guo Z, Quadir F, Liu J, Cheng J. Mahmud S, et al. BMC Bioinformatics. 2022 Jul 19;23(1):283. doi: 10.1186/s12859-022-04829-1. BMC Bioinformatics. 2022. PMID: 35854211 Free PMC article.
Cryo2StructData: A Large Labeled Cryo-EM Density Map Dataset for AI-based Modeling of Protein Structures.
Giri N, Wang L, Cheng J. Giri N, et al. bioRxiv [Preprint]. 2024 Jan 2:2023.06.14.545024. doi: 10.1101/2023.06.14.545024. bioRxiv. 2024. Update in: Sci Data. 2024 May 6;11(1):458. doi: 10.1038/s41597-024-03299-9. PMID: 37398020 Free PMC article. Updated. Preprint.

See all "Cited by" articles

References

1. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, and Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. - PMC - PubMed
1. Baines Mandeep, Bhosale Shruti, Caggiano Vittorio, Goyal Naman, Goyal Siddharth, Ott Myle, Lefaudeux Benjamin, Liptchinsky Vitaliy, Rabbat Mike, Sheiffer Sam, Sridhar Anjali, and Xu Min. Fairscale: A general purpose modular pytorch library for high performance and large scale training.
1. Biewald Lukas. Experiment tracking with weights and biases, 2020. Software available from wandb.com.
1. Brown Tom B, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
1. Chen Chen, Chen Xiao, Wu Tianqi, Alex Morehead, and Cheng Jianlin. Improved protein structure accuracy estimation with graph-based equivariant networks. In preparation, 2021.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

[1] Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, and Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. - PMC - PubMed

[2] Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, and Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. - PMC - PubMed

[3] Baines Mandeep, Bhosale Shruti, Caggiano Vittorio, Goyal Naman, Goyal Siddharth, Ott Myle, Lefaudeux Benjamin, Liptchinsky Vitaliy, Rabbat Mike, Sheiffer Sam, Sridhar Anjali, and Xu Min. Fairscale: A general purpose modular pytorch library for high performance and large scale training.

[4] Baines Mandeep, Bhosale Shruti, Caggiano Vittorio, Goyal Naman, Goyal Siddharth, Ott Myle, Lefaudeux Benjamin, Liptchinsky Vitaliy, Rabbat Mike, Sheiffer Sam, Sridhar Anjali, and Xu Min. Fairscale: A general purpose modular pytorch library for high performance and large scale training.

[5] Biewald Lukas. Experiment tracking with weights and biases, 2020. Software available from wandb.com.

[6] Biewald Lukas. Experiment tracking with weights and biases, 2020. Software available from wandb.com.

[7] Brown Tom B, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

[8] Brown Tom B, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

[9] Chen Chen, Chen Xiao, Wu Tianqi, Alex Morehead, and Cheng Jianlin. Improved protein structure accuracy estimation with graph-based equivariant networks. In preparation, 2021.

[10] Chen Chen, Chen Xiao, Wu Tianqi, Alex Morehead, and Cheng Jianlin. Improved protein structure accuracy estimation with graph-based equivariant networks. In preparation, 2021.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Affiliations

High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources