SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
Authors: Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that a LLa MA3-8B model, trained over three iterations guided by SPAR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. We conduct experiments on several LLMs, LLa MA3 series (Meta AI, 2024), GLM-4-9B (GLM et al., 2024), and Mistral-7B-Instruct (Jiang et al., 2023a), over multiple iterations. Through extensive experiments, we demonstrate significant improvements in the models instruction-following capability, outperforming other self-improvement methods such as self-rewarding (Yuan et al., 2024) and meta-rewarding (Wu et al., 2024). |
| Researcher Affiliation | Collaboration | Jiale Cheng1,2 , Xiao Liu2,3 , Cunxiang Wang2,3 , Xiaotao Gu2 , Yida Lu1,2 , Dan Zhang3 , Yuxiao Dong3 , Jie Tang3 , Hongning Wang1 , Minlie Huang1 ... 1The Conversational Artificial Intelligence (Co AI) Group, Tsinghua University 2Zhipu AI 3The Knowledge Engineering Group (KEG), Tsinghua University |
| Pseudocode | Yes | We show the detailed process of BFS and DFS refinement in Algorithm 1 and Algorithm 2. |
| Open Source Code | Yes | Our code and data are publicly available at https://github.com/thu-coai/SPa R. |
| Open Datasets | Yes | Our code and data are publicly available at https://github.com/thu-coai/SPa R. ... We construct a high-quality dataset with 43K complex instruction-following prompts and an SFT dataset that can improve the instruction-following capabilities of LLMs. ... To assess the actor’s ability to follow instructions, we rely on two widely-used benchmarks: IFEval (Zhou et al., 2023) and Follow Bench (Jiang et al., 2023b). ... For assessing the refiner’s judgment capability, we turn to LLMBar (Zeng et al., 2023), a dataset designed to measure the assessment ability of LLMs in the context of instruction-following tasks. |
| Dataset Splits | Yes | The taxonomy-based prompt construction results in about 43k prompts. We utilize 8k prompts for actor initialization, another 5k for the refiner, and save 30k for further self-play training. ... To evaluate the refiner’s capability in refinement, we split 200 samples from the DRSFT to create a test set... |
| Hardware Specification | Yes | All experiments are performed on an 8 80G Nvidia A100 setup. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' and 'DPO'/'RFT' as methods but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python version) used to implement the experiments. |
| Experiment Setup | Yes | Both the actor and refiner are trained with a learning rate of 2e-6 and a warmup ratio of 0.1, using the Adam W optimizer with β1 = 0.9 and β2 = 0.999. The actor is trained over 5 epochs with a batch size of 64, and the refiner is trained for 3 epochs with the same batch size. In the data construction process, we set a tree search budget of 15 to strike a balance between performance and efficiency. ... For the actor iterative training, each iteration uses around 5k examples for DPO. To enhance training stability as suggested by (Hou et al., 2024), an additional SFT loss is added to the chosen response with a weight of 0.1. Here, the learning rate is set to 2e-7, β to 0.1, with a warmup ratio of 0.1, and training is conducted for 1 epoch with a batch size of 32. For the refiner, each iteration utilizes about 10k examples, including 4k refinement samples. We ensure the judgment training dataset maintains a balance of positive and negative samples. The training configuration remains the same as for SFT, except the learning rate is set to 1e-6. All experiments are performed on an 8 80G Nvidia A100 setup. |