Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug;644(8078):1002-1009.
doi: 10.1038/s41586-025-09215-4. Epub 2025 Jul 2.

A foundation model to predict and capture human cognition

Affiliations

A foundation model to predict and capture human cognition

Marcel Binz et al. Nature. 2025 Aug.

Abstract

Establishing a unified theory of cognition has been an important goal in psychology1,2. A first step towards such a theory is to create a computational model that can predict human behaviour in a wide range of settings. Here we introduce Centaur, a computational model that can predict and simulate human behaviour in any experiment expressible in natural language. We derived Centaur by fine-tuning a state-of-the-art language model on a large-scale dataset called Psych-101. Psych-101 has an unprecedented scale, covering trial-by-trial data from more than 60,000 participants performing in excess of 10,000,000 choices in 160 experiments. Centaur not only captures the behaviour of held-out participants better than existing cognitive models, but it also generalizes to previously unseen cover stories, structural task modifications and entirely new domains. Furthermore, the model's internal representations become more aligned with human neural activity after fine-tuning. Taken together, our results demonstrate that it is possible to discover computational models that capture human behaviour across a wide range of domains. We believe that such models provide tremendous potential for guiding the development of cognitive theories, and we present a case study to demonstrate this.

PubMed Disclaimer

Conflict of interest statement

Competing interests: F.J.T. consults for Immunai, CytoReason, Cellarity, BioTuring and Genbio.AI, and has an ownership interest in Dermagnostix and Cellarity. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of Psych-101 and Centaur.
a, Psych-101 comprises trial-by-trial data from 160 psychological experiments with 60,092 participants making 10,681,650 choices in total and involving 253,597,411 text tokens. It contains domains such as multi-armed bandits, decision-making, memory, supervised learning, Markov decision processes and others (the examples shown have been stylized and abbreviated for readability). b, Centaur is a foundation of model human cognition that is obtained by adding low-rank adapters to a state-of-the-art language model and fine-tuning it on Psych-101.
Fig. 2
Fig. 2. Goodness-of-fit on Psych-101.
a, Difference in log-likelihood of Centaur and Llama relative to a domain-specific cognitive model for each experiment. A value of zero corresponds to the goodness-of-fit of the domain-specific cognitive model and a value above zero indicates improved goodness-of-fit to human responses. Log-likelihoods are averaged over responses (n = 992,867). Error bars correspond to the standard error of the mean. Centaur outperforms both Llama and a collection of domain-specific cognitive models in almost every experiment (one-sided t-tests: t(1,985,732) = −144.22, P ≤ 0.0001; t(1,985,732) = −127.58, P ≤ 0.0001, respectively). We only included experiments for which we have implemented a domain-specific cognitive model in this graphic and merged different studies using the same paradigm. Extended Data Table 1 contains numerical results for all experiments. b, Model simulations on the horizon task. The plot shows the probability densities over reward and an information bonus parameter for both people and simulated runs of Centaur. c, Model simulations on the two-step task. The plot shows the probability densities over reward and a parameter indicating how model-based learning was for both people and simulated runs of Centaur. d, Model simulations on a social prediction game. The plot shows the probability densities over accuracies of predicting human strategies and strategies of an artificial agent, with matched statistics for both people and simulated runs of Centaur.
Fig. 3
Fig. 3. Evaluation in different held-out settings.
a, Negative log-likelihoods averaged over responses (n = 9,702) for the two-step task with a modified cover story. b, Negative log-likelihoods averaged over responses (n = 510,154) for a three-armed bandit experiment. c, Negative log-likelihoods averaged over responses (n = 99,204) for an experiment probing logical reasoning with items based on the Law School Admission Test (LSAT). Centaur outperforms both Llama and domain-specific cognitive models when faced with modified cover stories, problem structures and entirely new domains. N/A, not applicable. Error bars show the s.e.m. The image in a is reproduced from ref. , Springer Nature Limited. The image in c is reproduced from Wikipedia.org.
Fig. 4
Fig. 4. Human alignment.
a, Multidimensional scaling embedding of the ten behavioural metrics in CogBench for different models. b, Pearson correlation coefficients indicating how well human neural activity in the two-step task can be decoded using Centaur’s internal representations extracted from different layers. c, Pearson correlation coefficients indicating how well human neural activity in a sentence-reading task can be decoded using Centaur’s internal representations extracted from different layers. Control refers to a model that used representations extracted from a randomly initialized transformer model with matched architecture.
Fig. 5
Fig. 5. Model-guided scientific discovery.
a, We used Psych-101 and Centaur to guide the development of a cognitive model for a multi-attribute decision-making study. Each panel shows the AIC for the set of models considered at the given stage, starting with the models considered in the original study. b, We asked DeepSeek-R1 to generate an explanation for the human responses and formalized the resulting verbal strategy into a formal computational model. c, We refined this model through scientific regret minimization using Centaur as a reference model. Six data points are shown for which Centaur makes accurate predictions but the DeepSeek-R1-discovered model does not. We then used this information to design a domain-specific cognitive model that is as predictive as Centaur but is still interpretable. The bicycle images in a are reproduced from Flaticon.com.
Extended Data Fig. 1
Extended Data Fig. 1. Psych-101.
a, Proportion of domains included in Psych-101. b, Word cloud of experimental paradigms included in Psych-101. c, We performed a data contamination analysis using the LogProber method for every experimental paradigm in Psych-101. LogProber fits a two-parameter exponential model to the cumulative log-likelihood of each sequence being checked for contamination. High acceleration (log B) suggests that a prompt is memorized from the pretraining data. Following the results presented in the original work, we set a threshold for possible contamination to log B ≥ 1. This analysis indicated no evidence of contamination. d, Two-dimensional embedding of the experiments used in this paper. To obtain this embedding, we took the corresponding natural language prompts up to the point of the first human choice, extracted a vector-based representation for them using ModernBERT, and finally projected these representations onto two dimensions using multidimensional scaling. Purple dots correspond to experiments from Psych-101, whereas the colored dots correspond to the indicated evaluation experiment.
Extended Data Fig. 2
Extended Data Fig. 2. Negative log-likelihoods of Centaur and alternative Llama variants on Psych-101.
To rule out the hypothesis that finetuning on any data aligns a model with human behavior, we compared Centaur to various Llama variants finetuned for other purposes (i.e. non-cognitive tasks). Nemotron is finetuned for instruction-following. Hermes is finetuned for various purposes, including agentic capabilities, roleplaying, reasoning, multi-turn conversation, and long context coherence. Reflection is finetuned for reasoning. None of the Llama variants captures human behavior better than the base model, ruling out the hypothesis that finetuning generally leads to models that are better at predicting human behavior. Error bars correspond to the standard error of the mean, taken over responses.
Extended Data Fig. 3
Extended Data Fig. 3. Noise ceiling analysis.
We conducted a noise ceiling analysis to better understand the capabilities of Centaur. It is not straightforward to estimate the noise ceiling for experiments with sequential dependencies, which includes the majority of Psych-101. Hence, we focused on two experiments for which such an analysis is possible: a, the choices13k data set and b, an intertemporal choice experiment. In both cases, we found that Centaur substantially exceeds the estimated noise ceiling. This is possible because Centaur can pick up on context-dependent patterns that are not captured by standard noise ceiling analyses. Therefore, we have performed an additional analysis testing how well Centaur can predict human responses if we prompt it to predict each response independently. We use the suffix “ind.” to indicate this way of prompting the model. Centaur still matches the performance of domain-specific cognitive models when context-independent prompts are used, amounting to roughly half of the estimated noise ceiling.
Extended Data Fig. 4
Extended Data Fig. 4. Further out-of-distribution evaluations.
Each subplot shows negative log-likelihoods for a different experiment. None of these paradigms were included in Psych-101, hence they provide a stress test for a model’s generalization capabilities. Centaur robustly captured human behavior in all of these settings, while smaller and non-finetuned models did not do so consistently. Error bars correspond to the standard error of the mean, taken over responses. We state one-sided t-tests comparing the negative log-likelihoods of Centaur to those of Llama in brackets. a, Negative log-likelihoods on moral decision-making (t(181388) = −103.54, p ≤ 0.0001). b, Negative log-likelihoods on economic games (t(7798) = −11.69, p ≤ 0.0001). c, Negative log-likelihoods on naturalistic category learning (t(21838) = −14.05, p ≤ 0.0001). d, Negative log-likelihoods on behavioral propensities (t(156230) = −11.06, p ≤ 0.0001). e, Negative log-likelihoods on naturalistic reward learning (t(9838) = −12.63, p ≤ 0.0001). f, Negative log-likelihoods on a deep sequential decision task (t(6092) = −1.06, p = 0.144).
Extended Data Fig. 5
Extended Data Fig. 5. metabench and CogBench results.
a, Results for metabench, a sparse benchmark containing several canonical benchmarks from the machine learning literature. We find that Centaur maintains the level of performance of Llama, indicating that finetuning on human behavior did not lead to deterioration in other tasks (ARC: z = −0.126, p = 0.9, GSM8K: z = −0.529, p = 0.597, HellaSwag: z = 0.0, p = 1.0, MMLU: z = 0.0, p = 1.0, Winogrande: z = −0.556, p = 0.578). Performance on TruthfulQA – which measures how models mimic human falsehoods – even improved significantly with finetuning (z = 2.312, p = 0.021; all z-test were two-sided). b, Performance-based metrics from CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. We find that – relative to Llama – Centaur’s performance improves in all experiments (Probabilistic reasoning: z = 6.371, p ≤ 0.0001, Horizon task: z = 22.176, p ≤ 0.0001, Restless bandit: z = 7.317, p ≤ 0.0001, Instrumental learning: z = 0.126, p = 0.45, Two-step task: z = 1.458, p = 0.072, Balloon analog risk task: z = 1.496, p = 0.067; all z-test were one-sided). c, Behavioral metrics from CogBench. We observe that Centaur becomes more similar to human subjects in all ten behavioral metrics (Prior weighting: z = 2.176, p = 0.015, Likelihood weighting: z = 1.131, p = 0.129, Directed exploration: z = 0.525, p = 0.3, Random exploration: z = 2.014, p = 0.022, Meta-cognition: z = 2.206, p = 0.014, Learning rate: z = 0.477, p = 0.317, Optimism bias: z = 0.78, p = 0.218, Model-basedness: z = 9.608, p ≤ 0.0001, Temporal discounting: z = 2.594, p = 0.005, Risk taking: z = 1.612, p = 0.053; all z-test were one-sided).
Extended Data Fig. 6
Extended Data Fig. 6. Finegrained neural alignment results in the two-step task.
a, Pearson correlation coefficients between the predicted activity from Centaur’s representations and the BOLD data shown on a surface brain (image created with nilearn). Centaur achieves the most accurate predictions in the left motor cortex. As participants performed the task with their right hand in the scanner, this effect may be explained by Centaur’s strong performance in predicting choices. b, Predictive performance of Centaur’s representations against alternatives for ROIs that have been identified as behaviorally relevant in previous work. Cortical scores are averaged over the corresponding bilateral parcels in the Schaefer atlas. The accumbens is defined based on the Harvard-Oxford atlas. Pearson correlation coefficients are shown for layer 20 but exhibit a similar pattern across all layers. Centaur outperformed Llama and the cognitive model in predicting activity in accumbens, the ROI from the original study that showed a reward prediction error effect,. We found a similar pattern in the medial PFC, another region that showed an effect in the original article, as well as in the sensory and motor cortices.
Extended Data Fig. 7
Extended Data Fig. 7. Log-likelihood comparison between Centaur and Minitaur on the analyses from the main text.
a, Negative log-likelihoods relative to the domain-specific cognitive models on held-out participants from Psych-101. Error bars correspond to the standard error of the mean, taken over responses. b, Negative log-likelihoods for the two-step task with a modified cover story. c, Negative log-likelihoods for a three-armed bandit experiment. d, Negative log-likelihoods for an experiment probing logical reasoning with items based on the Law School Admission Test (LSAT).

References

    1. Anderson, J. The Architecture of Cognition (Harvard Univ. Press, 1983).
    1. Newell, A. Unified Theories of Cognition (Harvard Univ. Press, 1990).
    1. Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci.40, e253 (2017). - PubMed
    1. Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science350, 1332–1338 (2015). - PubMed
    1. Goddu, M. K. & Gopnik, A. The development of human causal learning and reasoning. Nat. Rev. Psychol.10.1038/s44159-024-00300-5 (2024).

LinkOut - more resources