CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
Authors: Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We construct a comprehensive benchmark Sym Bench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, Code Steer LLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with Code Steer raises its average performance score from 53.3 to 86.4 |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology, Boston, MA, USA 2Harvard University, Boston, MA, USA 3University of Illinois Urbana-Champaign, Urbana, IL, USA 4MIT-IBM Watson AI Lab, Boston, MA, USA. Correspondence to: Yongchao Chen <EMAIL>, Chuchu Fan <EMAIL>. |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps for a method or procedure formatted as such. Figure 10 in Appendix F presents actual Python code for the Symbolic Checker, which is not pseudocode. |
| Open Source Code | Yes | Models, Datasets, and Codes are available at https://github.com/yongchao98/ Code Steer-v1.0 and https: //huggingface.co/yongchao98. |
| Open Datasets | Yes | Models, Datasets, and Codes are available at https://github.com/yongchao98/ Code Steer-v1.0 and https: //huggingface.co/yongchao98. |
| Dataset Splits | Yes | We randomly select 28 of the 37 Sym Bench tasks, using a distinct set of samples without overlap with the test samples. This setup allows us to evaluate Code Steer on 28 seen tasks (with different test samples) and on the remaining 9 unseen tasks. We synthesize 12k multi-turn guidance/generation trajectories for SFT and 5.5k guidance comparison pairs for DPO. The specific data number for each task is in Appendix Sec. G. Experimental settings We use GPT-4o as the Task LLM to test 28 seen and 9 unseen tasks, each with 100 samples of varying complexity. |
| Hardware Specification | Yes | Both processes are fine-tuned with full parameter on 4*H100 GPUs for 4-10 epochs. In most cases, we perform the inference of Code Steer LLM using a single H100 80GB GPU. However, to analyze the impact of hardware configurations on Code Steer runtime, as shown in Fig. 5, we also conduct inference using four H100 GPUs for comparison. |
| Software Dependencies | No | The paper mentions fine-tuning the Llama-3.1-8B model, but it does not provide specific version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA) used in the experiments. |
| Experiment Setup | Yes | The model is trained for 10 epochs in the SFT stage and 6 epochs in the DPO stage. The learning rate is set to 1 10 5 for SFT and 5 10 6 for DPO. We use a batch size of 4 for training. In DPO, the loss function follows the standard sigmoid loss (Rafailov et al., 2024), with the hyperparameter β set to 0.1. |