Evaluating language models for mathematics through interactions
- PMID: 38830100
- PMCID: PMC11181017
- DOI: 10.1073/pnas.2318124121
Evaluating language models for mathematics through interactions
Abstract
There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.
Keywords: AI; human–computer interaction; language models; theorem proving.
Conflict of interest statement
Competing interests statement:K.M.C. participated in a part-time Student Researcher Position at Google DeepMind during part of the work; however, this work does not represent any of her work at Google nor the views of Google and was conducted entirely outside of the position. A.Q.J. similarly has been a part-time researcher at Mistral AI during part of the work. The study was formulated prior to either authors’ involvement and does not connect to their work. Y.W. joined xAI after the conception of this work; xAI is not involved.
Figures
References
-
- R. Bommasani et al., On the opportunities and risks of foundation models. arXiv [Preprint] (2021). https://arxiv.org/2108.07258 (Accessed 1 March 2023).
-
- Brown T., et al. , Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020).
-
- H. Touvron et al., LLaMA: Open and efficient foundation language models. arXiv [Preprint] (2023). 10.48550/arXiv.2302.13971 (Accessed 1 March 2023). - DOI
-
- R. Anil et al., PaLM 2 Technical Report (2023).
-
- OpenAI, Introducing ChatGPT (2022). https://openai.com/blog/chatgpt. Accessed 1 March 2023.
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
