Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov:2021:46-57.
doi: 10.1109/mlhpc54614.2021.00010. Epub 2021 Dec 27.

High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Affiliations

High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Mu Gao et al. Workshop Mach Learn HPC Environ. 2021 Nov.

Abstract

Computational biology is one of many scientific disciplines ripe for innovation and acceleration with the advent of high-performance computing (HPC). In recent years, the field of machine learning has also seen significant benefits from adopting HPC practices. In this work, we present a novel HPC pipeline that incorporates various machine-learning approaches for structure-based functional annotation of proteins on the scale of whole genomes. Our pipeline makes extensive use of deep learning and provides computational insights into best practices for training advanced deep-learning models for high-throughput data such as proteomics data. We showcase methodologies our pipeline currently supports and detail future tasks for our pipeline to envelop, including large-scale sequence comparison using SAdLSA and prediction of protein tertiary structures using AlphaFold2.

Keywords: computational biology; deep learning; high-performance computing; machine learning; protein sequence alignment; protein structure prediction.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of SAdLSA, a deep-learning algorithm for protein sequence alignment. The symbol ⊗ denotes outer concatenation of two embedding vectors
Fig. 2.
Fig. 2.
Scheme for SAdLSA deployment at scale on Summit
Fig. 3.
Fig. 3.
Performance of SAdLSA on large sets of proteins and scaling across Summit. A: Runtime, in seconds, against the Pfam database for each sequence using one node. B: Runtime, in seconds, against the Pfam database for each sequence using one thousand nodes. Parallelization was over the Pfam database for each sequence input for inference calculation.
Fig. 4.
Fig. 4.
Performance of SAdLSA on the PDB70 database on Summit. Top: Weak scaling– total time in minutes for a given number of sequences to be aligned to the PDB70 using 12 sequences, 96 nodes, and 120 sequences, 1000 nodes. The set of 12 was chosen to have a similar length distribution ad the set of 120. Bottom: Runtime, in seconds, against the PDB70 database for each sequence in the 120 sequence set used in Figure 3 above, using one thousand nodes. Parallelization was over the PDB70 database for each sequence input for inference calculation.
Fig. 5.
Fig. 5.
Scheme of the model parallelization method on a Summit node for domain boundary prediction
Fig. 6.
Fig. 6.
Simplified scheme for annotation pipeline with consensus using our toolbox.

Similar articles

Cited by

References

    1. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, and Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. - PMC - PubMed
    1. Baines Mandeep, Bhosale Shruti, Caggiano Vittorio, Goyal Naman, Goyal Siddharth, Ott Myle, Lefaudeux Benjamin, Liptchinsky Vitaliy, Rabbat Mike, Sheiffer Sam, Sridhar Anjali, and Xu Min. Fairscale: A general purpose modular pytorch library for high performance and large scale training.
    1. Biewald Lukas. Experiment tracking with weights and biases, 2020. Software available from wandb.com.
    1. Brown Tom B, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
    1. Chen Chen, Chen Xiao, Wu Tianqi, Alex Morehead, and Cheng Jianlin. Improved protein structure accuracy estimation with graph-based equivariant networks. In preparation, 2021.

LinkOut - more resources