Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb;9(2):305-315.
doi: 10.1038/s41562-024-02046-9. Epub 2024 Nov 27.

Large language models surpass human experts in predicting neuroscience results

Affiliations

Large language models surpass human experts in predicting neuroscience results

Xiaoliang Luo et al. Nat Hum Behav. 2025 Feb.

Abstract

Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. Here, to evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs indicated high confidence in their predictions, their responses were more likely to be correct, which presages a future where LLMs assist humans in making discoveries. Our approach is not neuroscience specific and is transferable to other knowledge-intensive endeavours.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Backward-looking and forward-looking evaluations.
a, Backward-looking benchmarks involve recalling factual information. For example, a student retrieves a fact about the Gettysburg Address that they learned during a history class. Existing benchmarks in scientific domains are in essence backward-looking as they emphasize retrieving accepted facts for question answering and reasoning tasks. b, Forward-looking benchmarks involve predicting novel outcomes on the basis of past data. Two forms of uncertainty, aleatoric (due to intrinsic randomness) and epistemic (due to lack of knowledge), may be present. For example, a table tennis fan predicts which player will win the next set on the basis of their knowledge of the players, how they have played so far today and so forth. Inherent random factors, such as a breeze affecting the ball’s flight, will also be present.
Fig. 2
Fig. 2. BrainBench is a forward-looking benchmark for neuroscience.
BrainBench evaluates test-takers' ability to predict neuroscience results. BrainBench’s test cases were sourced from recent Journal of Neuroscience abstracts across five neuroscience domains: behavioural/cognitive, systems/circuits, neurobiology of disease, cellular/molecular and developmental/plasticity/repair. Test-takers chose between the original abstract and one altered to substantially change the result while maintaining coherency. Human experts and LLMs were tasked with selecting the correct (that is, original) version from the two options. Human experts made choices and provided confidence and expertise ratings in an online study. LLMs were scored as choosing the abstract with the lower perplexity (that is, the text passage that was less surprising to the model), and their confidence was proportional to the difference in perplexity between the two options.
Fig. 3
Fig. 3. Performance of human experts and LLMs on BrainBench.
a, LLMs outperformed human experts on BrainBench (t(14) = 25.8, P < 0.001, Cohen’s d = 9.27, 95% CI 0.17–0.2; two-sided). Smaller models are on par with larger models. Base versions of models outperformed chat and instruct versions (t(5) = 5.38, P = 0.002, Cohen’s d = 0.77, 95% CI 0.02–0.04; two-sided), which were tuned to be conversational with humans. The error bars represent the standard error of the accuracy. Each model was evaluated on 200 BrainBench test cases. In total, 171 human experts were evaluated on the same test cases over 1,011 trials. b, The distribution of test cases across neuroscience subfields roughly mirrors the distribution of articles in the Journal of Neuroscience with behaviour/cognitive overrepresented. The average performance of 15 LLMs and human experts is shown. LLMs outperformed human experts in every subfield (see Supplemetary Fig. 5 for the full results). c, The participants were predoctoral students (ntrial = 104), doctoral students (ntrial = 300), postdoctoral researchers (ntrial = 255), faculty/academic staff (ntrial = 256), research scientists (ntrial = 72) and others (ntrial = 24). The error bars represent the standard error of the accuracy.
Fig. 4
Fig. 4. Accuracy and confidence are calibrated for human experts and LLMs.
When human experts and LLMs are confident in their BrainBench judgements, they are more likely to be correct. Confidence ratings were sorted and placed in equally sized bins with the mean accuracy for items in that bin plotted. The positive slope of the black regression lines for human experts and all LLMs indicates that confidence is well calibrated (that is, higher confidence corresponds to higher accuracy). Calibration is beneficial for building human–machine ensembles.
Fig. 5
Fig. 5. Fine-tuning a pre-trained LLM on neuroscience knowledge.
Mistral-7B-v0.1 was fine-tuned using LoRA on neuroscience articles from 2002 to 2022 (a total of 1.3 billion tokens). a, The fine-tuned model improved by 3% on BrainBench. b, The fine-tuning process substantially shifted the perplexity distribution of correct responses, indicative of the LLM specializing in neuroscience.

References

    1. Bornmann, L. & Mutz, R. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Assoc. Inf. Sci. Technol.66, 2215–2222 (2015).
    1. Chu, J. S. G. & Evans, J. A. Slowed canonical progress in large fields of science. Proc. Natl Acad. Sci. USA118, e2021636118 (2021). - PMC - PubMed
    1. Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature596, 590–596 (2021). - PMC - PubMed
    1. Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol.37, 1038–1040 (2019). - PubMed
    1. Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature571, 95–98 (2019). - PubMed

LinkOut - more resources