reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs

Authors: Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani Tur

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-bystep verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.
Researcher Affiliation	Academia	1University of Illinois at Urbana Champaign. Correspondence to: Sagnik Mukherjee <EMAIL>.
Pseudocode	Yes	Algorithm 1 Constructing and Evaluating PARC
Open Source Code	Yes	1Our code and data is available on https://github.com /Sagnik Mukherjee/PARC
Open Datasets	Yes	We introduce and will release PERL, a dataset of reasoning chains annotated with premises and error types, to facilitate broader research on premise-centered reasoning verification. ... We used existing datasets of math word problems to create PERL, our testbed for step-level premise and error annotations. Generating Reasoning Chains: In order to generate the reasoning chains, we used two popular benchmarks (i) GSM8K (Cobbe et al., 2021), a collection of 8,500 grade school math word problems, and (ii) MATH (Hendrycks et al., 2021b), a dataset of 12,500 challenging competition level math word problems. In addition, we use the (iii) Orca-Math dataset by Mitra et al. (2024), a synthetic dataset of 200K math problems alongside solutions written by GPT4Turbo, and the (iv) Meta Math QA dataset by (Yu et al., 2023).
Dataset Splits	Yes	We first randomly sampled 1000 examples from the GSM8k and MATH test split and the Orca-Math and Meta Math QA training split (since these are training datasets). ... Next, we randomly sampled 50 positive (correct) and 50 negative (incorrect) reasoning chains. To expand our dataset, we employed GPT-4o to systematically introduce mathematical or logical errors into the correct reasoning chains, creating additional synthetic negative examples... This results in a total of 607 reasoning chains with 203 positives, 214 negatives, and 190 synthetic negatives.
Hardware Specification	No	Model For the Llama model (Grattafiori et al., 2024), we used v LLM (Kwon et al., 2023) for model serving and Azure Open AI for the GPT4o and GPT4-o1 (Open AI, 2024) models. To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant.
Software Dependencies	No	Model For the Llama model (Grattafiori et al., 2024), we used v LLM (Kwon et al., 2023) for model serving and Azure Open AI for the GPT4o and GPT4-o1 (Open AI, 2024) models. To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant.
Experiment Setup	Yes	To ensure reproducibility, all generations were performed with a temperature=0. For all models, we used their instruct variant. ... The prompts we used for our experiments are shared in Appendix A.5. ... Appendix A.5. Prompt for Error identification The prompts used for the baseline approach are shared in Appendix 14 and 15 . The evaluation for our error identification with premises are done with the prompts outlined in Tables 16 and 17,