reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Planning in Natural Language Improves LLM Search for Code Generation

Authors: Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across Human Eval+, MBPP+, and Live Code Bench (a contamination-free benchmark for competitive coding). Applying PLANSEARCH on top of Claude 3.5 Sonnet achieves a pass@200 of 77.0% on Live Code Bench, outperforming both the best pass-rate achieved without any search (pass@1 = 41.4%) and using standard repeated sampling on top of existing non-search models (pass@200 = 60.6%).
Researcher Affiliation	Collaboration	Evan Wang 2 Federico Cassano 3,4 Catherine Wu 5 Yunfeng Bai1 William Song1 Vaskar Nath1 Ziwen Han1 Sean Hendryx1 Summer Yue1 Hugh Zhang1 1Scale AI 2California Institute of Technology 3Anysphere 4Northeastern University 5Anthropic, work done while at Scale AI
Pseudocode	Yes	Pseudocode Strategy Description...These natural language solutions are then translated into pseudocode, which are subsequently translated into actual Python code. We take a more granular approach to reduce the translation error (which may cause the model to revert to its original mode, disregarding the reasoned-through observations). We provide all prompts for all sections in Appendix M.4.
Open Source Code	No	No explicit statement or link to the authors' own source code repository for the described methodology is provided in the paper.
Open Datasets	Yes	We evaluate our search methods on three benchmarks: MBPP+, Human Eval+ (Liu et al., 2023), and Live Code Bench (Jain et al., 2024). MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021) are some of the most widely used code benchmarks in the field. However, since both benchmarks provide only a few test cases, Liu et al. (2023) updates both benchmarks with additional test cases that increase the benchmarks robustness to reward hacking. Live Code Bench is a benchmark for coding that consists of competitive programming problems which typically require advanced reasoning capabilities.
Dataset Splits	No	The paper mentions using a subset of problems from Live Code Bench based on date: "For this paper, we use only the subset of problems between May 2024 and September 2024 to avoid possibilities of contamination." However, it does not specify explicit training, validation, or test splits for these or other datasets in terms of percentages, counts, or splitting methodology.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies	No	The paper mentions "Python programmer" and models like "Claude 3.5 Sonnet" but does not specify version numbers for Python or any other key software libraries, frameworks, or solvers used in their implementation.
Experiment Setup	Yes	All models are run with temperature 0.9 and top-p of 0.95. (o1-mini was run with temperature 1.0 and top-p of 1.0 because of API constraints.) Temperature was determined through a coarse hyperparameter sweep on REPEATED SAMPLING and IDEASEARCH from T {0.0, 0.1, 0.2, . . . , 1.2}, which we describe in Appendix F.