Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs

Authors: Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani Tur

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-bystep verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.
Researcher Affiliation Academia 1University of Illinois at Urbana Champaign. Correspondence to: Sagnik Mukherjee <EMAIL>.
Pseudocode Yes Algorithm 1 Constructing and Evaluating PARC
Open Source Code Yes 1Our code and data is available on https://github.com /Sagnik Mukherjee/PARC
Open Datasets Yes We introduce and will release PERL, a dataset of reasoning chains annotated with premises and error types, to facilitate broader research on premise-centered reasoning verification. ... We used existing datasets of math word problems to create PERL, our testbed for step-level premise and error annotations. Generating Reasoning Chains: In order to generate the reasoning chains, we used two popular benchmarks (i) GSM8K (Cobbe et al., 2021), a collection of 8,500 grade school math word problems, and (ii) MATH (Hendrycks et al., 2021b), a dataset of 12,500 challenging competition level math word problems. In addition, we use the (iii) Orca-Math dataset by Mitra et al. (2024), a synthetic dataset of 200K math problems alongside solutions written by GPT4Turbo, and the (iv) Meta Math QA dataset by (Yu et al., 2023).
Dataset Splits Yes We first randomly sampled 1000 examples from the GSM8k and MATH test split and the Orca-Math and Meta Math QA training split (since these are training datasets). ... Next, we randomly sampled 50 positive (correct) and 50 negative (incorrect) reasoning chains. To expand our dataset, we employed GPT-4o to systematically introduce mathematical or logical errors into the correct reasoning chains, creating additional synthetic negative examples... This results in a total of 607 reasoning chains with 203 positives, 214 negatives, and 190 synthetic negatives.
Hardware Specification No Model For the Llama model (Grattafiori et al., 2024), we used v LLM (Kwon et al., 2023) for model serving and Azure Open AI for the GPT4o and GPT4-o1 (Open AI, 2024) models. To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant.
Software Dependencies No Model For the Llama model (Grattafiori et al., 2024), we used v LLM (Kwon et al., 2023) for model serving and Azure Open AI for the GPT4o and GPT4-o1 (Open AI, 2024) models. To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant.
Experiment Setup Yes To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant. ... The prompts we used for our experiments are shared in Appendix A.5. ... Appendix A.5. Prompt for Error identification The prompts used for the baseline approach are shared in Appendix 14 and 15 . The evaluation for our error identification with premises are done with the prompts outlined in Tables 16 and 17,