Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs
Authors: Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani Tur
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-bystep verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations. |
| Researcher Affiliation | Academia | 1University of Illinois at Urbana Champaign. Correspondence to: Sagnik Mukherjee <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Constructing and Evaluating PARC |
| Open Source Code | Yes | 1Our code and data is available on https://github.com /Sagnik Mukherjee/PARC |
| Open Datasets | Yes | We introduce and will release PERL, a dataset of reasoning chains annotated with premises and error types, to facilitate broader research on premise-centered reasoning verification. ... We used existing datasets of math word problems to create PERL, our testbed for step-level premise and error annotations. Generating Reasoning Chains: In order to generate the reasoning chains, we used two popular benchmarks (i) GSM8K (Cobbe et al., 2021), a collection of 8,500 grade school math word problems, and (ii) MATH (Hendrycks et al., 2021b), a dataset of 12,500 challenging competition level math word problems. In addition, we use the (iii) Orca-Math dataset by Mitra et al. (2024), a synthetic dataset of 200K math problems alongside solutions written by GPT4Turbo, and the (iv) Meta Math QA dataset by (Yu et al., 2023). |
| Dataset Splits | Yes | We first randomly sampled 1000 examples from the GSM8k and MATH test split and the Orca-Math and Meta Math QA training split (since these are training datasets). ... Next, we randomly sampled 50 positive (correct) and 50 negative (incorrect) reasoning chains. To expand our dataset, we employed GPT-4o to systematically introduce mathematical or logical errors into the correct reasoning chains, creating additional synthetic negative examples... This results in a total of 607 reasoning chains with 203 positives, 214 negatives, and 190 synthetic negatives. |
| Hardware Specification | No | Model For the Llama model (Grattafiori et al., 2024), we used v LLM (Kwon et al., 2023) for model serving and Azure Open AI for the GPT4o and GPT4-o1 (Open AI, 2024) models. To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant. |
| Software Dependencies | No | Model For the Llama model (Grattafiori et al., 2024), we used v LLM (Kwon et al., 2023) for model serving and Azure Open AI for the GPT4o and GPT4-o1 (Open AI, 2024) models. To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant. |
| Experiment Setup | Yes | To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant. ... The prompts we used for our experiments are shared in Appendix A.5. ... Appendix A.5. Prompt for Error identification The prompts used for the baseline approach are shared in Appendix 14 and 15 . The evaluation for our error identification with premises are done with the prompts outlined in Tables 16 and 17, |