This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Nov 12:2023.11.09.566468.

doi: 10.1101/2023.11.09.566468.

Genetic Discovery Enabled by A Large Language Model

Tao Tu¹, Zhouqing Fang², Zhuanfen Cheng², Svetolik Spasic³, Anil Palepu¹, Konstantina M Stankovic³, Vivek Natarajan¹, Gary Peltz²

Affiliations

¹ Google Research, Mountain View, CA, USA.
² Department of Anesthesiology, Pain and Perioperative Medicine.
³ Department of Otolaryngology - Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA 94305, USA.

PMID: 37986848
PMCID: PMC10659415
DOI: 10.1101/2023.11.09.566468

Genetic Discovery Enabled by A Large Language Model

Tao Tu et al. bioRxiv. 2023.

[Preprint]. 2023 Nov 12:2023.11.09.566468.

doi: 10.1101/2023.11.09.566468.

Authors

Tao Tu¹, Zhouqing Fang², Zhuanfen Cheng², Svetolik Spasic³, Anil Palepu¹, Konstantina M Stankovic³, Vivek Natarajan¹, Gary Peltz²

Affiliations

¹ Google Research, Mountain View, CA, USA.
² Department of Anesthesiology, Pain and Perioperative Medicine.
³ Department of Otolaryngology - Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA 94305, USA.

PMID: 37986848
PMCID: PMC10659415
DOI: 10.1101/2023.11.09.566468

Abstract

Artificial intelligence (AI) has been used in many areas of medicine, and recently large language models (LLMs) have shown potential utility for clinical applications. However, since we do not know if the use of LLMs can accelerate the pace of genetic discovery, we used data generated from mouse genetic models to investigate this possibility. We examined whether a recently developed specialized LLM (Med-PaLM 2) could analyze sets of candidate genes generated from analysis of murine models of biomedical traits. In response to free-text input, Med-PaLM 2 correctly identified the murine genes that contained experimentally verified causative genetic factors for six biomedical traits, which included susceptibility to diabetes and cataracts. Med-PaLM 2 was also able to analyze a list of genes with high impact alleles, which were identified by comparative analysis of murine genomic sequence data, and it identified a causative murine genetic factor for spontaneous hearing loss. Based upon this Med-PaLM 2 finding, a novel bigenic model for susceptibility to spontaneous hearing loss was developed. These results demonstrate Med-PaLM 2 can analyze gene-phenotype relationships and generate novel hypotheses, which can facilitate genetic discovery.

PubMed Disclaimer

Figures

**Figure 1 ∣. The Med-PaLM 2 pipeline for genetic discovery.**
**(A)** Overview of the genetic analysis pipeline. A set of candidate genes are identified through analysis of the results obtained from either Genome Wide Association Study (GWAS) or genomic sequence comparison. Med-PaLM 2 evaluates the gene candidates (represented by their gene symbols) and it generates a genetic hypothesis by identifying those with the strongest association with a queried phenotype. **(B)** Overview of how Med-PaLM 2 generated a bigenic model for spontaneous hearing loss in a mouse strain. The genomic sequence of a mouse strain (NOD/LtJ) that spontaneously develops a hearing loss by 7 weeks of age was compared with that of 10 strains that maintain normal hearing during their lifetime. Fourteen genes with NOD/LtJ-specific high impact alleles were identified by this analysis. Med-PaLM 2 identified one gene as the most likely to contain the causative genetic factor for hearing loss. However, NOD/LtJ mice have another genetic factor, that is shared among multiple inbred strains with early onset hearing loss, which is necessary but it alone is not sufficient to cause their severe hearing loss. Therefore, based upon the Med-PaLM 2 results, the genetic hypothesis developed was that two genetic factors (i.e., a bigenic model) jointly contribute to the hearing loss of NOD/LtJ mice.

**Figure 2 ∣. NOD/LtJ mice have a severe hearing loss.**
Auditory brainstem responses (ABR) and distortion product otoacoustic emissions (DPOAE) were measured in 7-week-old CBA/J (n=10, red), NOD/LtJ (n=7, light gray) and C57BL/6J (n=14, dark gray) mice. Each bar represents the mean ± standard error of the mean, and each dot represents the sound pressure level (SPL) in decibels measured for one mouse. The ABR threshold levels for NOD/LtJ mice demonstrate that they have a profound hearing loss compared to that of CBA/J and C57BL/6J mice of the same age. The DPOAE thresholds in NOD/LtJ mice are substantially elevated across all frequencies tested in comparison to that of CBA/J and C57BL/6 mice. Interestingly, C57BL/6 have a hearing defect that is less severe than that of NOD/LtJ mice, and they show significantly higher DPOAE thresholds in mid-to-higher frequency range than CBA/J mice. The p-values for the CBA/J vs. NOD/LtJ comparisons are represented by asterisks: *p<0.05, **p<0.01, ***p<0.001, and ****p<0.0001. Similarly, the p-values for C57BL/6J vs. NOD/LtJ comparisons represented by #, and + are used to represent the p-values for CBA/J vs. C57BL/6J comparisons.

**Figure 3 ∣**
**(A)** NOD/LtJ mice have a 2 bp frameshift deletion in exon 4 of *Crym*, which is not present in 42 other strains. **(B)** The full length Crym protein has 313 amino acids, but the NOD/LTJ frameshift deletion within the codon for amino acid 220, which generates a premature termination codon at amino acid 230. **(C)** The Crym protein structure (PDB: 4BVA) is shown; and the position of NOD/LtJ-unique frameshift mutation that disrupts the COOH-terminal region of Crym is highlighted along with the location of the NH2- and COOH-terminal amino acids. **(D)** Crym protein is not present in NOD/LtJ mice. The proteins in lysates prepared from brain tissue obtained from a 5-week-old male A/J, CBA/J or NOD/LtJ mice were separated by SDS-PAGE and immunoblotted with mouse monoclonal anti-Crym or anti-tubulin antibodies. The blots were scanned after incubation with dye-labelled goat anti-mouse IgG. While the lysates have comparable amounts of tubulin, Crym is not present in the NOD/LtJ lysates.

See this image and copyright information in PMC

References

1. Fang Z. & Peltz G. An automated multi-modal graph-based pipeline for mouse genetic discovery. Bioinformatics 38, 3385–3394 (2022). - PMC - PubMed
1. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł. & Polosukhin I. Attention is all you need. Advances in neural information processing systems 30 (2017).
1. Chowdhery A., Narang S., Devlin J., Bosma M., Mishra G., Roberts A., Barham P., Chung H. W., Sutton C., Gehrmann S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
1. Singhal K., Azizi S., Tu T., Mahdavi S. S., Wei J., Chung H. W., Scales N., Tanwani A., Cole-Lewis H., Pfohl S., et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). - PMC - PubMed
1. Singhal K., Tu T., Gottweis J., Sayres R., Wulczyn E., Hou L., Clark K., Pfohl S., Cole-Lewis H., Neal D., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Genetic Discovery Enabled by A Large Language Model

Affiliations

Genetic Discovery Enabled by A Large Language Model

Authors

Affiliations

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources