Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 11;121(24):e2318124121.
doi: 10.1073/pnas.2318124121. Epub 2024 Jun 3.

Evaluating language models for mathematics through interactions

Affiliations

Evaluating language models for mathematics through interactions

Katherine M Collins et al. Proc Natl Acad Sci U S A. .

Abstract

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.

Keywords: AI; human–computer interaction; language models; theorem proving.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:K.M.C. participated in a part-time Student Researcher Position at Google DeepMind during part of the work; however, this work does not represent any of her work at Google nor the views of Google and was conducted entirely outside of the position. A.Q.J. similarly has been a part-time researcher at Mistral AI during part of the work. The study was formulated prior to either authors’ involvement and does not connect to their work. Y.W. joined xAI after the conception of this work; xAI is not involved.

Figures

Fig. 1.
Fig. 1.
(A) Contrasting typical static evaluation (Top) with interactive evaluation (Bottom), wherein a human iteratively queries a model and rates the quality of responses. (B) Example subset of the chat interface from CheckMate where users interact with an LLM. The participant can type their query (Lower Left), which is compiled in LaTeX (Lower Right). When ready, the participant can press “Interact” and have their query routed to the model. The user can continue for multiple interactions. The entire chat history is presented for the user to refer to. When the user is done with their interaction, they can press “Done with interaction” to be proceed to rating the interaction. The participant also sees the problem text above the chat window; we include full screenshots in SI Appendix, Figs. S2 and S3.
Fig. 2.
Fig. 2.
(A) Counts of model preference ratings postinteractions; participants ranked models based on preference as a mathematical assistant (lower rank is better). Ties were allowed and are included: Participants were permitted to assign the same rank to multiple models (SI Appendix, Additional Survey Observations). (B) Mathematical correctness and perceived helpfulness scores (all scores are an integer {0,1,...,6}; higher is better) received for each model. Full details about the text associated with the scales of each score are included in SI Appendix, Additional Survey Details. (C) Comparing participants’ scores of the mathematical correctness against perceived helpfulness of each models’ generations. Each dot is a score for a single human–model interaction. We add slight jitter for visual ease given that points overlap. Interestingly, we observe cases where the perceived helpfulness and correctness of a generation diverge; that is, particular instances can be deemed incorrect yet somewhat helpful, or correct, but somewhat unhelpful. (D) The relationship between correctness and helpfulness scores and whether the step is terminal (i.e., the step after which the participant stopped interacting for a particular problem). The size of the bubbles indicates the number of that particular score pair (correctness, helpfulness). For a fixed score pair, the opacity indicates the ratio of stopping steps, that is, the number of terminal steps divided by the number of total steps.
Fig. 3.
Fig. 3.
(A) Query profiles as a function of the interaction step. Users prefer to ask for definitions or general mathematics questions, and to paste in the full text, in the first interaction, compared to correcting the model’s output, asking why, etc. in the second interaction. Interaction step 0 is the initial interaction; step 1 is the query made after receiving the first AI response to the query made in step 0. (B) Query profiles for the first interaction step (i.e., step 0) as a function of the amount of experience the user has with AI systems prior to participating.

References

    1. R. Bommasani et al., On the opportunities and risks of foundation models. arXiv [Preprint] (2021). https://arxiv.org/2108.07258 (Accessed 1 March 2023).
    1. Brown T., et al. , Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020).
    1. H. Touvron et al., LLaMA: Open and efficient foundation language models. arXiv [Preprint] (2023). 10.48550/arXiv.2302.13971 (Accessed 1 March 2023). - DOI
    1. R. Anil et al., PaLM 2 Technical Report (2023).
    1. OpenAI, Introducing ChatGPT (2022). https://openai.com/blog/chatgpt. Accessed 1 March 2023.

LinkOut - more resources