reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Authors: Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We construct a comprehensive benchmark Sym Bench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, Code Steer LLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with Code Steer raises its average performance score from 53.3 to 86.4
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology, Boston, MA, USA 2Harvard University, Boston, MA, USA 3University of Illinois Urbana-Champaign, Urbana, IL, USA 4MIT-IBM Watson AI Lab, Boston, MA, USA. Correspondence to: Yongchao Chen <EMAIL>, Chuchu Fan <EMAIL>.
Pseudocode	No	The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps for a method or procedure formatted as such. Figure 10 in Appendix F presents actual Python code for the Symbolic Checker, which is not pseudocode.
Open Source Code	Yes	Models, Datasets, and Codes are available at https://github.com/yongchao98/ Code Steer-v1.0 and https: //huggingface.co/yongchao98.
Open Datasets	Yes	Models, Datasets, and Codes are available at https://github.com/yongchao98/ Code Steer-v1.0 and https: //huggingface.co/yongchao98.
Dataset Splits	Yes	We randomly select 28 of the 37 Sym Bench tasks, using a distinct set of samples without overlap with the test samples. This setup allows us to evaluate Code Steer on 28 seen tasks (with different test samples) and on the remaining 9 unseen tasks. We synthesize 12k multi-turn guidance/generation trajectories for SFT and 5.5k guidance comparison pairs for DPO. The specific data number for each task is in Appendix Sec. G. Experimental settings We use GPT-4o as the Task LLM to test 28 seen and 9 unseen tasks, each with 100 samples of varying complexity.
Hardware Specification	Yes	Both processes are fine-tuned with full parameter on 4*H100 GPUs for 4-10 epochs. In most cases, we perform the inference of Code Steer LLM using a single H100 80GB GPU. However, to analyze the impact of hardware configurations on Code Steer runtime, as shown in Fig. 5, we also conduct inference using four H100 GPUs for comparison.
Software Dependencies	No	The paper mentions fine-tuning the Llama-3.1-8B model, but it does not provide specific version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA) used in the experiments.
Experiment Setup	Yes	The model is trained for 10 epochs in the SFT stage and 6 epochs in the DPO stage. The learning rate is set to 1 10 5 for SFT and 5 10 6 for DPO. We use a batch size of 4 for training. In DPO, the loss function follows the standard sigmoid loss (Rafailov et al., 2024), with the hyperparameter β set to 0.1.