Planning in Natural Language Improves LLM Search for Code Generation
Authors: Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across Human Eval+, MBPP+, and Live Code Bench (a contamination-free benchmark for competitive coding). Applying PLANSEARCH on top of Claude 3.5 Sonnet achieves a pass@200 of 77.0% on Live Code Bench, outperforming both the best pass-rate achieved without any search (pass@1 = 41.4%) and using standard repeated sampling on top of existing non-search models (pass@200 = 60.6%). |
| Researcher Affiliation | Collaboration | Evan Wang 2 Federico Cassano 3,4 Catherine Wu 5 Yunfeng Bai1 William Song1 Vaskar Nath1 Ziwen Han1 Sean Hendryx1 Summer Yue1 Hugh Zhang1 1Scale AI 2California Institute of Technology 3Anysphere 4Northeastern University 5Anthropic, work done while at Scale AI |
| Pseudocode | Yes | Pseudocode Strategy Description...These natural language solutions are then translated into pseudocode, which are subsequently translated into actual Python code. We take a more granular approach to reduce the translation error (which may cause the model to revert to its original mode, disregarding the reasoned-through observations). We provide all prompts for all sections in Appendix M.4. |
| Open Source Code | No | No explicit statement or link to the authors' own source code repository for the described methodology is provided in the paper. |
| Open Datasets | Yes | We evaluate our search methods on three benchmarks: MBPP+, Human Eval+ (Liu et al., 2023), and Live Code Bench (Jain et al., 2024). MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021) are some of the most widely used code benchmarks in the field. However, since both benchmarks provide only a few test cases, Liu et al. (2023) updates both benchmarks with additional test cases that increase the benchmarks robustness to reward hacking. Live Code Bench is a benchmark for coding that consists of competitive programming problems which typically require advanced reasoning capabilities. |
| Dataset Splits | No | The paper mentions using a subset of problems from Live Code Bench based on date: "For this paper, we use only the subset of problems between May 2024 and September 2024 to avoid possibilities of contamination." However, it does not specify explicit training, validation, or test splits for these or other datasets in terms of percentages, counts, or splitting methodology. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper mentions "Python programmer" and models like "Claude 3.5 Sonnet" but does not specify version numbers for Python or any other key software libraries, frameworks, or solvers used in their implementation. |
| Experiment Setup | Yes | All models are run with temperature 0.9 and top-p of 0.95. (o1-mini was run with temperature 1.0 and top-p of 1.0 because of API constraints.) Temperature was determined through a coarse hyperparameter sweep on REPEATED SAMPLING and IDEASEARCH from T {0.0, 0.1, 0.2, . . . , 1.2}, which we describe in Appendix F. |