reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment

Authors: Yuze Zhao, Tianyun Ji, Wenjun Feng, Zhenya Huang, Qi Liu, Zhiding Liu, Yixiao Ma, Kai Zhang, Enhong Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our testing on these benchmarks reveals that LLMs continue to struggle with identifying satisfactory reasoning pathways. Additionally, we present a new pathway exploration pipeline inspired by human intricate problemsolving methods. This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi-step reasoning, resulting in performance gains of up to 3 . Finally, we expand this pipeline by applying it to simulate complex household tasks in real-world scenarios, specifically in Virtual Home, enhancing the handling of failure cases. Our experimental results indicate that RHDA methods effectively mitigate reasoning failures caused by data sparsity. With the same or even lower overhead, this method achieved performance improvements of up to three times compared to baseline methods.
Researcher Affiliation	Academia	Yuze Zhao1, Tianyun Ji1, , Wenjun Feng1, , Zhenya Huang1,2, , Qi Liu1,2, Zhiding Liu1, Yixiao Ma1, Kai Zhang1, Enhong Chen1 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center EMAIL EMAIL
Pseudocode	Yes	This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi-step reasoning, resulting in performance gains of up to 3 . Finally, we expand this pipeline by applying it to simulate complex household tasks in real-world scenarios, specifically in Virtual Home, enhancing the handling of failure cases. We release our code and all of results at https://github.com/TnTWoW/code_reasoning. [...] Figure 2: An overview of pipeline to solve code reasoning task. We decompose the hypothesis and generate executable functions step by step. After comparing the results with the seen observations and receiving feedback, we propose amendments, reflect on potential errors at each step, and generate revised hypotheses. This process is repeated until a valid problem-solving pathway is discovered. For concise expression, we show partial code snippets.
Open Source Code	Yes	We release our code and all of results at https://github.com/TnTWoW/code_reasoning. [...] Our code, datasets and experimental results are available at https://github.com/TnTWoW/ code_reasoning. Additionally, Appendix H contains details about pipeline and prompts used in method.
Open Datasets	Yes	Inductive code reasoning encompasses four challenging PBE tasks, two of which are GPL tasks: List Function (Rule, 2020) and Mini ARC (Kim et al., 2022), while the other two are DSL tasks: Robust Fill (Devlin et al., 2017) and Deep Coder (Balog et al., 2016). [...] Deductive code and abductive code reasoning can be regarded as opposite processes; therefore, we selected two identical and representative datasets, CRUXEval (Gu et al., 2024) and Live Code Bench (Jain et al., 2024), as benchmarks to validate these two capabilities. [...] Our code, datasets and experimental results are available at https://github.com/TnTWoW/ code_reasoning.
Dataset Splits	Yes	For inductive code reasoning, we establish four baseline methods. The Input-Output (IO) prompting requires the LLM to predict outputs based on all seen observations and an unseen input. ... A successful prediction for a single instance does not generate a hypothesis that satisfies all observations, resulting in a high prediction accuracy but a relatively low task accuracy. ... Due to the limited context lengths of LMs, we only use the first 16 examples from BIG-Bench (bench authors, 2023): 8 for seen examples and 8 for unseen examples. We manually examined the exemplars and found 8 examples are generally sufficient to describe the pattern. ... Table 7: The number of tasks per dataset, the numbers of seen examples per task, and unseen examples per task.
Hardware Specification	No	We utilize the latest and most advanced model, gpt-4o-2024-08-06, as the backbone LLM for all our experiments. We report the results using Llama-3.1-70B-Instruct, Qwenmax (qwen-max-2024-09-19) (Bai et al., 2023), Claude 3.5 (claude-3-5-sonnet-20240620) in Appendix B.
Software Dependencies	Yes	We utilize the latest and most advanced model, gpt-4o-2024-08-06, as the backbone LLM for all our experiments. We report the results using Llama-3.1-70B-Instruct, Qwenmax (qwen-max-2024-09-19) (Bai et al., 2023), Claude 3.5 (claude-3-5-sonnet-20240620) in Appendix B. Following the methodology of Qiu et al. (2024), we set the temperature to 0.7. ... After offloading the execution to the tool (e.g., Python executor) and receiving feedback, amendments are proposed to modify the initial hypothesis.
Experiment Setup	Yes	Experimental Setup. We utilize the latest and most advanced model, gpt-4o-2024-08-06, as the backbone LLM for all our experiments. We report the results using Llama-3.1-70B-Instruct, Qwenmax (qwen-max-2024-09-19) (Bai et al., 2023), Claude 3.5 (claude-3-5-sonnet-20240620) in Appendix B. Following the methodology of Qiu et al. (2024), we set the temperature to 0.7. We report results using several methods: input-output (IO) prompting, standard prompting, Chain of Thought (Co T) (Wei et al., 2023), Program of Thought (Po T) (Chen et al., 2023), Chain of Code (Co C) (Li et al., 2024), Self-Consistency (SC) (Wang et al., 2023c) and Self-Refine (SR) (Madaan et al., 2024), all implemented with 2-shot learning. For our proposed process, we employ 0-shot prompts, allowing the LLM to explore problem-solving pathways in a more flexible manner. We provide detailed prompt templates in Appendix H. ... Table 1: RHDA method on inductive code reasoning task. T refers to the maximum number of iterations. N refers to the number of candidates.