Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan;22(1):82-91.
doi: 10.1038/s41592-024-02525-x. Epub 2024 Nov 28.

Evaluation of large language models for discovery of gene set function

Affiliations

Evaluation of large language models for discovery of gene set function

Mengzhou Hu et al. Nat Methods. 2025 Jan.

Abstract

Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants.

PubMed Disclaimer

Conflict of interest statement

Competing interests: T.I. is a cofounder and member of the advisory board and has an equity interest in Data4Cure and Serinus Biosciences. T.I. is a consultant for and has an equity interest in Ideaya Biosciences. The terms of these arrangements have been reviewed and approved by the University of California San Diego in accordance with its conflict-of-interest policies. The other authors declare no competing interests.

Figures

Extended Data Fig. 1 ∣
Extended Data Fig. 1 ∣. Schematic of the citation module.
a, GPT-4 is asked to provide gene symbol keywords and functional keywords separately. Multiple gene keywords and functions are combined and used to search PubMed for relevant paper titles and abstracts in the scientific literature. GPT-4 is queried to evaluate each abstract, saving supporting references. b, Prompts used to query the GPT-4 model.
Extended Data Fig. 2 ∣
Extended Data Fig. 2 ∣. Distribution of GO term gene sizes.
a, Distribution of term size (number of genes) for terms in the Biological Process branch (GO-BP). Terms with 3-100 genes shown (n = 8,910). b, Distribution of term size for the 1000 GO terms used in Task 1.
Extended Data Fig. 3 ∣
Extended Data Fig. 3 ∣. Evaluation of GPT-4 in recovery of GO-CC and GO-MF names.
a, Cumulative number of GO-CC term names recovered by GPT-4 (y-axis) at a given similarity percentile (x-axis). 0 = least similar, 100 = most similar. Blue curve: semantic similarities between GPT-4 names and assigned GO-CC term names. Grey dashed curve: semantic similarities between GPT-4 names and random GO-CC term names. The red dotted line marks that 642 of the 1000 sampled GO-CC names are recovered by GPT-4 at a similarity percentile of 95%. b, As for panel a, but for GO-MF terms rather than GO-CC. The red dotted line marks that 757 of the 1000 sampled GO-MF names are recovered by GPT-4 at a similarity percentile of 95%.
Extended Data Fig. 4 ∣
Extended Data Fig. 4 ∣. Supplemental analysis of the confidence score.
a, Distribution of confidence scores (n = 300) assigned by GPT-4 with confidence level threshold set based on the distribution pattern. “High confidence” (red): 0.87–1.00; “Medium confidence” (blue): 0.82–0.86; “Low confidence” (dark orange): 0.01–0.81; “Name not assigned” (gray): 0. b, Scatter plot of naming accuracy versus GPT-4 self-assessed confidence score for real gene sets drawn from GO (points, n = 100). Accuracy is estimated by the semantic similarity between the GPT-4 proposed name and the real GO term name. The best-fit regression line is shown in dark gray. The correlation coefficient (R) is determined by a two-sided Pearson’s correlation with p-value shown.
Extended Data Fig. 5 ∣
Extended Data Fig. 5 ∣. Distribution of ‘omics gene set sizes.
Distribution shown for all ‘omics gene sets considered in this study (n = 300).
Fig. 1 ∣
Fig. 1 ∣. Use and evaluation of LLMs for functional analysis of gene sets.
a, The LLM prompt (left boxes) includes system content, detailed chain of thought instructions, and an example gene set query with desired response (full prompt given in Extended Data Table 1). The specific list of genes is inserted into the ‘user input of genes/proteins’ field at the end of the prompt template, resulting in generation of a proposed name, a supporting analysis essay and a confidence score (right flowchart). b, Benchmarking LLM names against names assigned by GO (evaluation task 1). The proposed name from each of five LLMs (left robot icons) is compared with the name assigned by the GO curators (handshake icon). GPT-4 (crowned) was the winning model for this task. c, Exploration of gene sets discovered in omics data (evaluation task 2). The GPT-4 name and analysis are scored for novelty and accuracy (right green check marks). Gene sets derived from three different data types (left database icons).
Fig. 2 ∣
Fig. 2 ∣. Evaluation of LLMs in recovering GO gene set names.
a, The performance of each LLM (colors) scored by semantic similarity between its proposed name for a gene set and the name assigned by GO curators. Results for 100 GO terms are shown (dots; the horizontal black lines show median semantic similarities). Significant difference in distributions is determined using a two-sided Mann–Whitney U test. b, The percentile calibration of semantic similarity between the GO and GPT-4 names for a gene set, shown for the GO term ‘response to X-ray’ and the corresponding GPT-4 name ‘DNA damage response and repair’. The plot shows the semantic similarity between these two names (vertical dark-green line, 0.54) versus the complete distribution of semantic similarity scores between the GPT-4 name and each name in the GO biological process database (GO-BP, gray). The GPT-4 name score is converted to a percentile, that is, the percentage of all names in GO with lower similarity (here, 99%). The dashed red line denotes the 95th percentile threshold. c, The cumulative number of GO term names recovered by GPT-4 (y axis) at a given similarity percentile (x axis). 0, least similar; 100, most similar. The dark-green curve shows the semantic similarities between GPT-4 names and assigned GO term names. The dashed gray curve shows the semantic similarities between GPT-4 names and random GO term names. The dotted red line marks the number of GO names recovered by GPT-4 at the 95th similarity percentile. d, A pie chart summarizing the results of the GPT-4 name/GO name similarity comparison. e, A hierarchical view of the GO term ‘negative regulation of triglyceride catabolic process’ and its ancestors. Blue box: gene set query; yellow box: gene set of best match GO name (most similar GO name to GPT-4 name); dashed lines with arrows: semantic similarities between names; red text: GPT-4 proposed name.
Fig. 3 ∣
Fig. 3 ∣. Evaluation of LLM self-confidence.
a, Investigation of model-assigned confidence scores (chat bubbles) for the ability to distinguish real GO terms from 50/50 mix and random gene sets (light DNA strands from the same GO term, dark DNA strands randomly selected from outside the GO term). b, Bar graphs showing the confidence rating assigned by each model for real, contaminated or random gene sets. Increasing shades of purple indicate low to high score bins. ‘High confidence’ (dark purple): 0.87–1.00; ‘medium confidence’ (medium purple): 0.82–0.86; ‘low confidence’ (light purple): 0.01–0.81; ‘name not assigned’ (gray): 0. For comparison with functional enrichment (rightmost group of bars), ‘high confidence’ for a gene set is defined as BH-adjusted P ≤ 0.05 (dark purple, g:Profiler with Benjamini–Hochberg correction), otherwise ‘name not assigned’ (gray) is used. A significant difference in confidence distributions between real, 50/50 mix and random is determined using a two-sided chi-squared test.
Fig. 4 ∣
Fig. 4 ∣. Evaluation of GPT-4 in naming ‘omics gene clusters.
a, The number of omics gene clusters (y axis, log10 scale) named by GPT-4 (dark green) or by GO enrichment analysis using g:Profiler (black; BH-adjusted P ≤ 0.05) versus the gene cluster specificity threshold measured by the Jaccard index (x axis; Methods). The vertical dashed red lines mark the same specificity thresholds shown in Extended Data Table 4. b, The number of cluster genes overlapping the genes associated with g:Profiler enriched GO term (y axis) is plotted against the number of genes in support of the GPT-4 name (x axis). The red points represent GPT-4 names highly similar to a significant g:Profiler name (semantic similarity ≥0.5); otherwise, navy color is used. The dotted black diagonal denotes equal specificity for the GPT-4 and g:Profiler names. c, Alternate names for cluster NeST:2-105 are shown (rows), with yellow boxes indicating which names support each of the cluster genes (columns). The GPT-4 name is shown first in bold (top), while the remaining rows highlight two of the significant g:Profiler results: the GO term with the best P value of enrichment (middle) and the term most conceptually similar to the GPT-4 name (bottom).
Fig. 5 ∣
Fig. 5 ∣. Representative analysis for protein interaction clusters (NeST:2-105).
Input gene set, 16 genes (top left pink box); GPT-4 generated cluster name (top right green box); GPT-4 confidence score (middle right green box); GPT-4 analysis text (bottom green box). Each generated paragraph is followed by the associated citations found by the citation module (Extended Data Fig. 1 and Methods).

Update of

References

    1. Zeeberg BR et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 4, R28 (2003). - PMC - PubMed
    1. Breitling R, Amtmann A & Herzyk P Iterative group analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinf. 5, 34 (2004). - PMC - PubMed
    1. Beissbarth T & Speed TP GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20, 1464–1465 (2004). - PubMed
    1. Subramanian A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005). - PMC - PubMed
    1. Al-Shahrour F. et al. From genes to functional classes in the study of biological systems. BMC Bioinf. 8, 114 (2007). - PMC - PubMed

LinkOut - more resources