reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective Prompt Anchoring for Code Generation

Authors: Yuan Tian, Tianyi Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA methods in all settings.
Researcher Affiliation	Academia	1Department of Computer Science, Purdue University, West Lafayette, IN, USA. Correspondence to: Yuan Tian <EMAIL>, Tianyi Zhang <EMAIL>.
Pseudocode	No	The paper describes the workflow of Selective Prompt Anchoring (SPA) using text and a diagram (Figure 1), and derives mathematical equations for augmented logits, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/magic-Yuan Tian/Selective Prompt-Anchoring.
Open Datasets	Yes	Human Eval (Chen et al., 2021) includes 164 Python tasks designed by Open AI and now has become a widely-used benchmark for code generation. MBPP (Austin et al., 2021) is another popular benchmark that includes 974 crowd-sourced Python tasks. Human Eval+ and MBPP+ (Liu et al., 2023) improves the original Human Eval and MBPP benchmarks with additional test cases to cover corner cases (Liu et al., 2024b). Human Eval-X (Hendrycks et al., 2021a) extends the Human Eval benchmark to support more programming languages such as Python, Java, Java Script, C++, and Go. Big Code Bench (Zhuo et al., 2024) is a more challenging benchmark for code generation that evaluates models abilities to follow complex instructions and use tools, including 1,140 real-world Python tasks across 139 libraries. Live Code Bench (Jain et al., 2024) is a contamination-free benchmark sourced from competitive programming platforms. ... Truthful QA (Lin et al., 2022), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021b), and Bool Q (Clark et al., 2019).
Dataset Splits	Yes	For each model and dataset, we use grid search to tune the anchoring strength ω on 1/5 of the tasks, and evaluate Pass@1 of SPA on the remaining 4/5 of the tasks. This process is repeated across all five folds, with final performance metrics averaged across folds. ... We divide the Human Eval dataset into three equal-sized subsets (Short, Medium, and Long) based on the 33rd and 66th percentiles of prompt lengths.
Hardware Specification	Yes	All experiments were conducted on a 64-bit Ubuntu 22.04 LTS system, equipped with an AMD EPYC 7313 CPU, eight NVIDIA A5500 GPUs, and 512GB of memory.
Software Dependencies	No	All six models in our paper are built upon the Huggingface Transformer library, which offers APIs to directly access and edit token embeddings and logits. Particularly, the SPA generator inherits the Huggingface Transformers generation API.
Experiment Setup	Yes	We set the Temperature to 0 and the beam to 1 for greedy decoding in all experiments, except for the one described in Appendix F. The anchoring strength ω serves as a hyperparameter in SPA. For each model and dataset, we use grid search to tune the anchoring strength ω on 1/5 of the tasks... A value of ω = 1.25 consistently yields robust improvements across all settings.