reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SFS: Smarter Code Space Search improves LLM Inference Scaling

Authors: Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Yanchi Liu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, Wei Cheng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Human Eval, MBPP, APPS, Code Contests, and Leetcode reveal significant performance gains. For instance, our method achieves a pass@1 rate of 67.1% on Human Eval+ and 87.2% on Human Eval with GPT-3.5, marking improvements of 8.6% and 4.3% over the state-of-the-art, while also halving the iterations needed to find the correct solution. Furthermore, our approach scales more efficiently than existing search techniques, including tree search, line search, and repeated sampling. We empirically validate our approach across multiple code generation benchmarks, including Human Eval, MBPP, Leetcode, APPS, and Code Contests, demonstrating significant improvements in accuracy, scalability, and solution diversity over existing search methods.
Researcher Affiliation	Collaboration	Jonathan Light*1, Yue Wu2, Yiyou Sun3, Wenchao Yu3, Yanchi liu3, Xujiang Zhao3, Ziniu Hu4 , Haifeng Chen3, Wei Cheng B3 1Rensselaer Polytechnic Institute, 2Princeton University, 3NEC Laboratories America, 4XAi lij54rpi.edu
Pseudocode	Yes	More details in Appendix C with pseudocode. ... Algorithm 1: Scattered Forest Search (SFS) with SCOUTING ... Python Pseudocode for SFS with SCOUTING
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for the SFS methodology described.
Open Datasets	Yes	We evaluate our method on several popular code generation benchmarks. Human Eval consists of 164 human-generated Python questions (Chen et al., 2021), and MBPP includes 399 problems (Austin et al., 2021). ... Both APPS (Hendrycks et al., 2021) and Code Contests (Li et al., 2022) feature challenging code competition problems. ... Leetcode (Guo et al., 2024) includes recent problems scraped from the website
Dataset Splits	No	From the 10,000 problems in APPS, we randomly sample 200 for evaluation due to budget constraints. We adapt the competition format of both datasets to resemble the Human Eval format for Python evaluation. The paper describes selecting problems for evaluation and mentions using
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, memory).
Software Dependencies	No	The paper mentions using specific LLM models like 'gpt-3.5-turbo', 'gpt-4o-mini', 'gpt-4o', and 'llama-3.1-8b', and tools like 'Code BERT', but does not list specific software libraries or programming language versions (e.g., Python 3.x, PyTorch 1.x) with version numbers necessary for reproducing the experimental environment.
Experiment Setup	Yes	We run each method for 10 iterations total using gpt-3.5-turbo on APPS and report the proportion of problems where the correct solution has been discovered at each iteration. ... We run search methods for 10 iters. each using gpt-3.5-turbo on Human Eval. ... SFS integrates three key components: scattering, foresting, and scouting, with tree search performed using Monte Carlo Tree Search (MCTS). ... The algorithm begins by initializing global insights I, a forest of seed solutions S0, and validation tests V. It then performs the following steps over N iterations. ... where c is a hyperparameter controlling the exploration-exploitation trade-off. ... In the LATS experimental setup, in each iteration after a solution is generated, the algorithm is allowed to check if the new solution is correct on the ground truth tests before proceeding (Zhou et al., 2024). We show the performance of our method in the same setup here, with the same solution budget.