Evaluating language models for mathematics through interactions

Katherine M Collins^#¹, Albert Q Jiang^#¹, Simon Frieder², Lionel Wong³, Miri Zilka¹, Umang Bhatt^{1

4

5}, Thomas Lukasiewicz^{2

6}, Yuhuai Wu⁷, Joshua B Tenenbaum³, William Hart¹, Timothy Gowers^{1

8}, Wenda Li¹, Adrian Weller^#^{1

4}, Mateja Jamnik^#¹

Affiliations

¹ University of Cambridge, Cambridge CB2 1TN, United Kingdom.
² University of Oxford, Oxford OX1 4BH, United Kingdom.
³ Massachusetts Institute of Technology, Cambridge, MA 02139.
⁴ The Alan Turing Institute, London NW1 2DB, United Kingdom.
⁵ New York University, New York, NY 10011.
⁶ Vienna University of Technology, Vienna 1040, Austria.
⁷ x.AI, New York, NY 10038.
⁸ Collége de France, Paris 75001, France.

^# Contributed equally.

PMID: 38830100
PMCID: PMC11181017
DOI: 10.1073/pnas.2318124121

Evaluating language models for mathematics through interactions

Katherine M Collins et al. Proc Natl Acad Sci U S A. 2024.

. 2024 Jun 11;121(24):e2318124121.

doi: 10.1073/pnas.2318124121. Epub 2024 Jun 3.

Authors

Affiliations

¹ University of Cambridge, Cambridge CB2 1TN, United Kingdom.
² University of Oxford, Oxford OX1 4BH, United Kingdom.
³ Massachusetts Institute of Technology, Cambridge, MA 02139.
⁴ The Alan Turing Institute, London NW1 2DB, United Kingdom.
⁵ New York University, New York, NY 10011.
⁶ Vienna University of Technology, Vienna 1040, Austria.
⁷ x.AI, New York, NY 10038.
⁸ Collége de France, Paris 75001, France.

^# Contributed equally.

PMID: 38830100
PMCID: PMC11181017
DOI: 10.1073/pnas.2318124121

Abstract

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.

Keywords: AI; human–computer interaction; language models; theorem proving.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:K.M.C. participated in a part-time Student Researcher Position at Google DeepMind during part of the work; however, this work does not represent any of her work at Google nor the views of Google and was conducted entirely outside of the position. A.Q.J. similarly has been a part-time researcher at Mistral AI during part of the work. The study was formulated prior to either authors’ involvement and does not connect to their work. Y.W. joined xAI after the conception of this work; xAI is not involved.

Figures

**Fig. 1.**
(A) Contrasting typical static evaluation (*Top*) with interactive evaluation (*Bottom*), wherein a human iteratively queries a model and rates the quality of responses. (B) Example subset of the chat interface from CheckMate where users interact with an LLM. The participant can type their query (*Lower Left*), which is compiled in LaTeX (*Lower Right*). When ready, the participant can press “Interact” and have their query routed to the model. The user can continue for multiple interactions. The entire chat history is presented for the user to refer to. When the user is done with their interaction, they can press “Done with interaction” to be proceed to rating the interaction. The participant also sees the problem text above the chat window; we include full screenshots in *SI Appendix*, Figs. S2 and S3.

**Fig. 2.**
(A) Counts of model preference ratings postinteractions; participants ranked models based on preference as a mathematical assistant (lower rank is better). Ties were allowed and are included: Participants were permitted to assign the same rank to multiple models (*SI Appendix*, Additional Survey Observations). (B) Mathematical correctness and perceived helpfulness scores (all scores are an integer $\in {0, 1, . . ., 6}$ ; higher is better) received for each model. Full details about the text associated with the scales of each score are included in *SI Appendix*, Additional Survey Details. (C) Comparing participants’ scores of the mathematical correctness against perceived helpfulness of each models’ generations. Each dot is a score for a single human–model interaction. We add slight jitter for visual ease given that points overlap. Interestingly, we observe cases where the perceived helpfulness and correctness of a generation diverge; that is, particular instances can be deemed incorrect yet somewhat helpful, or correct, but somewhat unhelpful. (D) The relationship between correctness and helpfulness scores and whether the step is terminal (i.e., the step after which the participant stopped interacting for a particular problem). The size of the bubbles indicates the number of that particular score pair (correctness, helpfulness). For a fixed score pair, the opacity indicates the ratio of stopping steps, that is, the number of terminal steps divided by the number of total steps.

**Fig. 3.**
(A) Query profiles as a function of the interaction step. Users prefer to ask for definitions or general mathematics questions, and to paste in the full text, in the first interaction, compared to correcting the model’s output, asking why, etc. in the second interaction. Interaction step 0 is the initial interaction; step 1 is the query made after receiving the first AI response to the query made in step 0. (B) Query profiles for the first interaction step (i.e., step 0) as a function of the amount of experience the user has with AI systems prior to participating.

See this image and copyright information in PMC

References

1. R. Bommasani et al., On the opportunities and risks of foundation models. arXiv [Preprint] (2021). https://arxiv.org/2108.07258 (Accessed 1 March 2023).
1. Brown T., et al. , Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020).
1. H. Touvron et al., LLaMA: Open and efficient foundation language models. arXiv [Preprint] (2023). 10.48550/arXiv.2302.13971 (Accessed 1 March 2023). - DOI
1. R. Anil et al., PaLM 2 Technical Report (2023).
1. OpenAI, Introducing ChatGPT (2022). https://openai.com/blog/chatgpt. Accessed 1 March 2023.

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

EP/T019603/1/EPSRC

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating language models for mathematics through interactions

Affiliations

Evaluating language models for mathematics through interactions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources