Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 21;14(1):19372.
doi: 10.1038/s41598-024-69021-2.

Generative artificial intelligence performs rudimentary structural biology modeling

Affiliations

Generative artificial intelligence performs rudimentary structural biology modeling

Alexander M Ille et al. Sci Rep. .

Abstract

Natural language-based generative artificial intelligence (AI) has become increasingly prevalent in scientific research. Intriguingly, capabilities of generative pre-trained transformer (GPT) language models beyond the scope of natural language tasks have recently been identified. Here we explored how GPT-4 might be able to perform rudimentary structural biology modeling. We prompted GPT-4 to model 3D structures for the 20 standard amino acids and an α-helical polypeptide chain, with the latter incorporating Wolfram mathematical computation. We also used GPT-4 to perform structural interaction analysis between the anti-viral nirmatrelvir and its target, the SARS-CoV-2 main protease. Geometric parameters of the generated structures typically approximated close to experimental references. However, modeling was sporadically error-prone and molecular complexity was not well tolerated. Interaction analysis further revealed the ability of GPT-4 to identify specific amino acid residues involved in ligand binding along with corresponding bond distances. Despite current limitations, we show the current capacity of natural language generative AI to perform basic structural biology modeling and interaction analysis with atomic-scale accuracy.

Keywords: Artificial intelligence; GPT; Language model; Machine learning; Protein modeling; Structural biology.

PubMed Disclaimer

Conflict of interest statement

A.M.I. is a founder and partner of North Horizon, which is engaged in the development of artificial intelligence-based software. C.M. declares no competing interests. S.K.B. declares no competing interests. M.B.M. declares no competing interests. R.P. is a founder and equity shareholder of PhageNova Bio. R.P. is Chief Scientific Officer and a paid consultant of PhageNova Bio. R.P. is a founder and equity shareholder of MBrace Therapeutics. R.P. serves as a paid consultant for MBrace Therapeutics. R.P. has Sponsored Research Agreements (SRAs) in place with PhageNova Bio and with MBrace Therapeutics. These arrangements are managed in accordance with the established institutional conflict-of-interest policies of Rutgers, The State University of New Jersey. This study falls outside of the scope of these SRAs. W.A. is a founder and equity shareholder of PhageNova Bio. W.A. is a founder and equity shareholder of MBrace Therapeutics. W.A. serves as a paid consultant for MBrace Therapeutics. W.A. has Sponsored Research Agreements (SRAs) in place with PhageNova Bio and with MBrace Therapeutics. These arrangements are managed in accordance with the established institutional conflict-of-interest policies of Rutgers, The State University of New Jersey. This study falls outside of the scope of these SRAs.

Figures

Figure 1
Figure 1
Modeling the 3D structures of the 20 standard amino acids with GPT-4. (a) Procedure for structure modeling and analysis. (b) Exemplary 3D structures of each of the 20 amino acids modeled by GPT-4. (c) Cα stereochemistry of modeled amino acids including L and D configurations as well as nonconforming planar; n = 5 per amino acid excluding achiral glycine and one GPT-4 iteration of cysteine (see Methods). (d,e) Backbone bond lengths and angles of amino acids modeled by GPT-4 (blue) relative to experimentally determined reference values (red); n = 5 per amino acid, excluding one iteration of cysteine (see Methods). Corresponding values of amino acids modeled by GPT-3.5 are shown adjacent (grey); n = 5 per amino acid. Data shown as means ± SD. (f) Sidechain accuracy of modeled amino acid structures in terms of bond lengths (within 0.1 Å) and bond angles (within 10°) relative to experimentally determined reference values; n = 5 per amino acid. See Methods for experimentally determined references. (g,h) Distributions of sidechain bond length and angle variation relative to experimentally determined reference values for each amino acid generated by GPT-4, excluding glycine. Bars represent the mean bond length or angle variation for each of the five iterations per amino acid. One of the methionine iterations was excluded (see Methods).
Figure 2
Figure 2
Modeling the 3D structure of an α-helical polypeptide structure with GPT-4. (a) Procedure for structure modeling and analysis. (b) Request made from GPT-4 to Wolfram and subsequent response from Wolfram to GPT-4 from an exemplary α-helix modeling iteration (also see Supplementary Table S2). (c) Exemplary 3D structure of a modeled α-helix (beige), an experimentally determined α-helix reference structure (PDB ID 1L64) (teal), and their alignment (RMSD = 0.147 Å). (d) Top-down view of modeled and experimental α-helices from panel c. (e) Accuracy of α-helix modeling as measured by number of attempts (including up to two refinements following the first attempt) required to generate a structure with RMSD < 0.5 Å relative to the experimentally determined reference structure; n = 5 rounds of 10 consecutive iterations (total n = 50 models). (f) Comparison of RMSDs between GPT-4 α-helix structures and the experimentally determined α-helix structure, the AlphaFold2 α-helix structure, the ChimeraX α-helix structure, and the PyMOL α-helix structure. Only structures with RMSD < 0.5 Å (dashed grey line) relative to each reference structure are included (88% included in reference to PDB ID 1L64; 90% to AlphaFold; 90% to ChimeraX; 88% to PyMOL). Data shown as means ± range.
Figure 3
Figure 3
Structural analysis of interaction between nirmatrelvir and the SARS-CoV-2 main protease. (a) Procedure for performing ligand interaction analysis. (b) Crystal structure of nirmatrelvir bound to the SARS-CoV-2 main protease (PDB ID: 7VH8) with bond-forming residues detected by GPT-4, and their bonds depicted with ChimeraX (inset). Distances between interacting atom pairs were 1.81 Å (Cys145 Sγ–C3), 2.68 Å (His163 Nε2–O1), 2.77 Å (Glu166 O–N4), 3.02 Å (His164 O–N1), as determined by GPT-4 and 1.814 Å (Cys145 Sγ–C3), 2.676 Å (His163 Nε2–O1), 2.767 Å (Glu166 O–N4), 2.851 Å (Glu166 N–O3), 3.019 Å (Glu166 Oε1–N2), 3.017 Å (His164 O–N1), as determined with ChimeraX. Note that distance values corresponding to the Glu166 N–O3 and Glu166 Oε1–N2 atom pair interactions were not provided by GPT-4.

Update of

References

    1. Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science349, 255–260. 10.1126/science.aaa8415 (2015). - PubMed
    1. Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature620, 47–60. 10.1038/s41586-023-06221-2 (2023). - PubMed
    1. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature596, 583–589. 10.1038/s41586-021-03819-2 (2021). - PMC - PubMed
    1. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science373, 871–876. 10.1126/science.abj8754 (2021). - PMC - PubMed
    1. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)-round XIV. Proteins89, 1607–1617. 10.1002/prot.26237 (2021). - PMC - PubMed

LinkOut - more resources