Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Authors: Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstate that AUTOIF significantly improves performance across three training algorithms SFT, Offline DPO, and Online DPO when applied to leading open-source LLMs, Qwen2-72B and LLa MA3-70B, in both self-alignment and strong-to-weak distillation settings. We conduct a comprehensive evaluation of five general instruction-following datasets, verfying AUTOIF s strong general instruction alignment capabilities. Notably, we first achieve Loose Instruction accuracy rates of 88.0% with Qwen2-72B and 90.4% with LLa MA3-70B on IFEval, the most widely used instruction-following benchmark, while significantly preserving the LLM s coding, mathematical, and general interaction capabilities.
Researcher Affiliation Industry Qwen Team, Alibaba Inc. EMAIL EMAIL
Pseudocode No The paper includes figures illustrating the method overview (Fig. 2) and training strategies (Fig. 3), and provides examples of verification functions in Python code (Table 6, Table 7), but it does not present structured pseudocode or algorithm blocks for the main AUTOIF algorithm.
Open Source Code Yes Our code are available at https://github.com/QwenLM/AutoIF
Open Datasets Yes We evaluate our methods using two widely-used instruction-following benchmarks: IFEval (Zhou et al., 2023) and Follow Bench (Jiang et al., 2024b) as main results IFEval comprises 25 types of verifiable instructions across about 500 prompts. To explore AUTOIF on more natural Instruction-following scenario, we further introduce the complex instruction-following dataset Info Bench(Qin et al., 2024b), the general natural instruction evaluation set MT-Bench (Zheng et al., 2023) and the real-world chatbot evaluation set Arena-hard (Zheng et al., 2023) as cross domain validation. At the same time, we also evaluated our models in CEval (Huang et al., 2023), MMLU (Hendrycks et al., 2021), GSM8k (Cobbe et al., 2021), and Human Eval (Chen et al., 2021a) to obtain a complete capability evaluation. Share GPT refers to the multi-turn chatting histories used by Vicuna Chiang et al. (2023). GSM8K (Cobbe et al., 2021) is a mathematical dataset... Human Eval (Chen et al., 2021b) includes 164 unique programming challenges... MMLU (Hendrycks et al., 2021) is a benchmark... C-Eval (Huang et al., 2023) consists of multiple-choice questions... MT-Bench (Zheng et al., 2023). MT-Bench is a comprehensive benchmark... Arena-Hard (Li et al., 2024c). Arena-Hard is a significant dataset... Info Bench (Qin et al., 2024b) Info Bench is a benchmark...
Dataset Splits Yes GSM8K (Cobbe et al., 2021) is a mathematical dataset designed to evaluate the mathematical problem-solving abilities of language models. It consists of 8,000 diverse grade school-level math word problems, which require understanding and manipulating mathematical concepts to arrive at a correct solution. It comprises high-quality grade school math problems, with 7,473 training samples and 1,319 testing samples.
Hardware Specification Yes We run all our experiments on NVIDIA A100 and H800 GPUs. Specifically, we train Qwen2-7B and LLa MA3-8B on 8 A100 GPUs, while Qwen2-72B-Instruct and LLa Ma3-70B-Instruct on 64 H800 GPUs.
Software Dependencies No In the SFT phase, we perform full fine-tuning on Qwen2-7B and LLa MA3-8B with a learning rate of 7e-6, using a linear scheduler with 20 warm-up steps. All models are trained with Deep Speed Ze RO Stage 3 (Rasley et al., 2020) and Flash-Attention 2 (Dao, 2023). ... For NLI filtering, we use m Deberta as our filtering model2, and filter out only samples predicted as "Contradiction" (approximately 15%). 2The NLI model is available at https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7. The paper mentions software tools like Deep Speed Ze RO Stage 3 and Flash-Attention 2 and a specific DeBERTa model, but does not provide specific version numbers for these software dependencies or a programming language version for the overall experimental setup.
Experiment Setup Yes In the SFT phase, we perform full fine-tuning on Qwen2-7B and LLa MA3-8B with a learning rate of 7e-6, using a linear scheduler with 20 warm-up steps. All models are trained with Deep Speed Ze RO Stage 3 (Rasley et al., 2020) and Flash-Attention 2 (Dao, 2023). We use a global batch size of 128, a weight decay of 0.1, and train for 3 epochs. Mixed precision training with bf16 is used, and the maximum context length is set to 8192 tokens. For Qwen2-72B and LLa MA3-70B, the global batch size is 512. In the DPO phase, the learning rate is set to 5e-7 with a cosine scheduler and a 0.1 warm-up ratio. We use Deep Speed Ze RO Stage 3 and Flash-Attention 2 for efficiency, with a global batch size of 64. Training utilizes a sigmoid loss function with a beta value of 0.3 and spans 2 epochs, with checkpoints every 200 steps. Mixed precision training with bf16 is employed, and the maximum context length is 4096 tokens.