reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cumulative Reasoning with Large Language Models

Authors: Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew C Yao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate CR s advantage through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over previous methods. In solving MATH problems, CR achieves a 4.2% increase from previous methods and a 43% relative improvement in the most challenging level 5 problems. When incorporating a code environment with CR, we further harness LLMs reasoning capabilities and outperform the Program of Thought (Po T) method by 38.8%.
Researcher Affiliation	Academia	Yifan Zhang EMAIL IIIS, Tsinghua University Jingqin Yang EMAIL IIIS, Tsinghua University Yang Yuan EMAIL IIIS, Tsinghua University Shanghai Qi Zhi Institute Andrew C Yao EMAIL IIIS, Tsinghua University Shanghai Qi Zhi Institute
Pseudocode	Yes	Figure 9: Prompt template for CR Proposer on logical inference tasks. Figure 10: Prompt template for CR Verifier on logical inference tasks. Figure 11: Prompt template for CR Reporter on logical inference tasks. Figure 12: Prompt template for CR Proposer on Game of 24. Figure 13: Prompt template for CR Verifier (a) on Game of 24. Figure 14: Prompt template for CR Verifier (b) on Game of 24. Figure 15: Prompt template for CR Reporter on Game of 24. Figure 16: Meta Prompt for CR with code environment on solving MATH problems.
Open Source Code	Yes	The code is available at https://github.com/iiis-ai/cumulative-reasoning.
Open Datasets	Yes	The FOLIO dataset (Han et al., 2022) is a collection of first-order logical inference problems expressed in natural language. The Auto TNLI dataset (Kumar et al., 2022) extends the INFOTABS dataset (Gupta et al., 2020) to construct a challenging Tabular Natural Language Inference task. The MATH dataset (Hendrycks et al., 2021) provides a comprehensive benchmark for mathematical reasoning across diverse subdomains such as Algebra and Geometry.
Dataset Splits	No	The paper mentions: 'a testable collection of 534 examples' for FOLIO-wiki, 'the curated set comprises 460 examples' for FOLIO-wiki-curated, 'limit our evaluation to the first 1,000 table-hypothesis pairs' for Auto TNLI, 'a set of 100 puzzles curated by To T (Yao et al., 2023)' for Game of 24, and 'using an 8-shot prompting strategy on a 500-example subset' for the MATH dataset. While these specify the size of subsets used for evaluation or prompting, they do not provide explicit training/validation/test dataset splits (e.g., percentages, exact counts for each split, or detailed splitting methodology with random seeds) for the main experimental process.
Hardware Specification	No	The paper mentions using specific LLMs (GPT-3.5-turbo, GPT-4, LLa MA-13B, and LLa MA-65B) and accessing OpenAI's chat-format APIs. However, it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory, or cloud instance specifications) on which these LLMs were run or the authors' experiments were conducted.
Software Dependencies	No	The paper mentions the 'Microsoft Guidance library (Lundberg et al., 2023)' and a 'Python code environment'. However, it does not specify version numbers for the Guidance library or the Python environment itself, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	We evaluate our method using the following LLMs: GPT-3.5-turbo, GPT-4, LLa MA-13B, and LLa MA-65B. In our implementation of Cumulative Reasoning (CR), the roles of Proposer, Verifier(s), and Reporter are instantiated using the same underlying LLM but distinguished by role-specific few-shot prompts. Throughout the experiments, we denote by n the number of intermediate propositions generated and by k the number of majority voting iterations. For decoding, we set the temperature to t = 0.1 by default and t = 0.7 for majority voting. GPT-3.5-turbo and GPT-4 are accessed via Open AI s chat-format APIs. The process terminates either when a solution is reported or when the iteration count exceeds a predefined limit (L = 50). We run multiple parallel branches (with breadth b ranging from 1 to 5) to account for variability in the search. Our reproduction follows the evaluation protocol of Lightman et al. (2023), using an 8-shot prompting strategy on a 500-example subset that spans all difficulty levels (Levels 1 5). CR outperforms Complex Co T by 5.4% in overall accuracy when using a 4-shot strategy. We adopted a default temperature setting of t = 0.0, consistent with prior research settings (greedy decoding).