SFS: Smarter Code Space Search improves LLM Inference Scaling
Authors: Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Yanchi Liu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, Wei Cheng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Human Eval, MBPP, APPS, Code Contests, and Leetcode reveal significant performance gains. For instance, our method achieves a pass@1 rate of 67.1% on Human Eval+ and 87.2% on Human Eval with GPT-3.5, marking improvements of 8.6% and 4.3% over the state-of-the-art, while also halving the iterations needed to find the correct solution. Furthermore, our approach scales more efficiently than existing search techniques, including tree search, line search, and repeated sampling. We empirically validate our approach across multiple code generation benchmarks, including Human Eval, MBPP, Leetcode, APPS, and Code Contests, demonstrating significant improvements in accuracy, scalability, and solution diversity over existing search methods. |
| Researcher Affiliation | Collaboration | Jonathan Light*1, Yue Wu2, Yiyou Sun3, Wenchao Yu3, Yanchi liu3, Xujiang Zhao3, Ziniu Hu4 , Haifeng Chen3, Wei Cheng B3 1Rensselaer Polytechnic Institute, 2Princeton University, 3NEC Laboratories America, 4XAi lij54rpi.edu |
| Pseudocode | Yes | More details in Appendix C with pseudocode. ... Algorithm 1: Scattered Forest Search (SFS) with SCOUTING ... Python Pseudocode for SFS with SCOUTING |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the SFS methodology described. |
| Open Datasets | Yes | We evaluate our method on several popular code generation benchmarks. Human Eval consists of 164 human-generated Python questions (Chen et al., 2021), and MBPP includes 399 problems (Austin et al., 2021). ... Both APPS (Hendrycks et al., 2021) and Code Contests (Li et al., 2022) feature challenging code competition problems. ... Leetcode (Guo et al., 2024) includes recent problems scraped from the website |
| Dataset Splits | No | From the 10,000 problems in APPS, we randomly sample 200 for evaluation due to budget constraints. We adapt the competition format of both datasets to resemble the Human Eval format for Python evaluation. The paper describes selecting problems for evaluation and mentions using |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions using specific LLM models like 'gpt-3.5-turbo', 'gpt-4o-mini', 'gpt-4o', and 'llama-3.1-8b', and tools like 'Code BERT', but does not list specific software libraries or programming language versions (e.g., Python 3.x, PyTorch 1.x) with version numbers necessary for reproducing the experimental environment. |
| Experiment Setup | Yes | We run each method for 10 iterations total using gpt-3.5-turbo on APPS and report the proportion of problems where the correct solution has been discovered at each iteration. ... We run search methods for 10 iters. each using gpt-3.5-turbo on Human Eval. ... SFS integrates three key components: scattering, foresting, and scouting, with tree search performed using Monte Carlo Tree Search (MCTS). ... The algorithm begins by initializing global insights I, a forest of seed solutions S0, and validation tests V. It then performs the following steps over N iterations. ... where c is a hyperparameter controlling the exploration-exploitation trade-off. ... In the LATS experimental setup, in each iteration after a solution is generated, the algorithm is allowed to check if the new solution is correct on the ground truth tests before proceeding (Zhou et al., 2024). We show the performance of our method in the same setup here, with the same solution budget. |