Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Sep 18:rs.3.rs-3270331.
doi: 10.21203/rs.3.rs-3270331/v1.

Evaluation of large language models for discovery of gene set function

Affiliations

Evaluation of large language models for discovery of gene set function

Mengzhou Hu et al. Res Sq. .

Update in

Abstract

Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI's GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in 'omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.

PubMed Disclaimer

Figures

Extended Data Fig. 1.
Extended Data Fig. 1.. Schematic of the citation module.
a, GPT-4 is asked to provide gene symbol keywords and functional keywords separately. b, Multiple gene keywords are combined with ‘OR’ and then with one function using ‘AND’. c, PubMed is searched for relevant titles and/or abstracts in the scientific literature. d, GPT-4 is queried to evaluate the abstract, saving supporting references. e, Prompts used for each query of the GPT-4 model.
Extended Data Fig. 2.
Extended Data Fig. 2.. Distribution of GO term gene sizes.
a, Distribution of number of genes per GO term among all terms in GO-BP (n = 12,214; y axis). b, Distribution of number of genes per GO term among selected 1000 GO terms (y axis).
Extended Data Fig. 3.
Extended Data Fig. 3.. Distribution of ‘omics gene set sizes.
Number of genes per gene set among all ‘omics gene sets considered in this study. (n=100).
Fig. 1.
Fig. 1.. Use and evaluation of GPT-4 for functional analysis of gene sets.
a, The prompt provided to GPT-4 for its input fields “System” and “User.” The specific list of genes is inserted into the {gene_set} field at the end of the prompt template. b, Generation of a proposed name for a gene set, with an accompanying analysis, using GPT-4. c, Benchmarking against GO gene sets: Calculation of semantic similarity between the GPT-4 proposed name and the name assigned by the GO curators (Left yellow star). Calibration of this similarity against the distribution of semantic similarities between the GPT-4 name and every term name in GO-BP. d, Benchmarking against ‘omics gene sets (MSigDB hallmark gene sets, NeST cancer systems): Computation of semantic similarity between the GPT-4 proposed name and the name curated by human experts (Right yellow star). Additional comparison to the name assigned by classical gene set enrichment analysis (GSEA).
Fig. 2.
Fig. 2.. Evaluation of GPT-4 in recovery of GO term names.
a, The raw semantic similarity between the GPT-4 name and the assigned GO name (blue dashed line, x-axis) is converted to the percentage of all names in the GO database with lower similarity to the GPT-4 name (blue dashed line, y-axis). Plot shown is for the GPT-4 name “DNA Repair and Chromosome Segregation”, generated for the GO term “Regulation of Sister Chromatid Cohesion.” b, Cumulative number of GO term names recovered by GPT-4 (y-axis) at a given similarity percentile (x-axis). 0 = least similar, 100 = most similar. Blue curve: semantic similarities between GPT-4 names and assigned GO term names. Grey dashed curve: semantic similarities between GPT-4 names and random GO term names. The red dotted line marks that half of the 1000 sampled GO names are recovered by GPT-4 at a similarity percentile of 98%. c, Hierarchical view of the GO term “Triglyceride Catabolic Process” and its ancestors. Blue box: gene set query, yellow box: gene set of best match GO name (most similar GO name to GPT-4 name), dashed lines with arrows: semantic similarities between names, red text: GPT-4 proposed name. d, Venn diagram showing overlap between the gene set query (blue, LEFT) and gene set of best match GO name (yellow, RIGHT). The false discovery rate (q-value) is obtained from the hypergeometric test between the two gene sets. e, Significance (q-value by hypergeometric test) of overlap between the gene set with the best match GO name and the gene set query (y-axis). Red dashed line: significance cutoff at q = 0.1 (Benjamini-Hochberg correction). Black horizontal line within violin plot shows median q-value (Best match GO: 1.40×10−8, Random GO: 1.0). **** p = 1.18×10−279 by Mann–Whitney U test.
Fig. 3.
Fig. 3.. Evaluation of GPT-4 for analysis of gene sets discovered in ‘omics data.
a, Violin plots showing the distribution of semantic similarity scores between GPT-4 names and expert-determined names (LEFT) or between GSEA and expert-determined names (RIGHT) for gene sets discovered in ‘omics studies (points, n = 100). Blue: Gene sets based on expression clusters (MSigDB). Red: Gene sets based on protein interaction clusters (NeST). Horizontal lines denote median ± upper and lower quartiles. b, Distribution of the percentage of genes informed by GPT-4 analysis (LEFT) or GSEA analysis (RIGHT). Plots elements same as panel a. For both panels, p-values determined by Wilcoxon nonparametric two-sample comparison. * p=10−3; ** p=10−7.

References

    1. Breitling R., Amtmann A. & Herzyk P. Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics 5, 34 (2004). - PMC - PubMed
    1. Huang D. W., Sherman B. T. & Lempicki R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009). - PubMed
    1. Pomaznoy M., Ha B. & Peters B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinformatics 19, 470 (2018). - PMC - PubMed
    1. Zeeberg B. R. et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 4, R28 (2003). - PMC - PubMed
    1. Al-Shahrour F. et al. From genes to functional classes in the study of biological systems. BMC Bioinformatics 8, 114 (2007). - PMC - PubMed

Publication types

LinkOut - more resources