Improving Model Alignment Through Collective Intelligence of Open-Source Models
Authors: Junlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation results show that our approach can improve win rate of LLa MA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on Alpaca Eval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that Mo AA enables a self-improvement pipeline, where models finetuned on Mo A-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released. |
| Researcher Affiliation | Collaboration | 1Duke University 2Together AI 3University of Chicago 4Stanford University. Correspondence to: Junlin Wang <EMAIL>. |
| Pseudocode | No | The paper describes methods using equations and prompt templates (Appendix K), but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | Data and code will be released. |
| Open Datasets | Yes | Our evaluation primarily focuses on two benchmarks for assessing LLM alignment with human preferences: Alpaca Eval 2 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024). [...] To comprehensively assess multi-turn capabilities and performance across diverse domains, we additionally employ MT-Bench (Zheng et al., 2023). [...] In terms of the instruction set, we mainly utilize Ultrafeedback (Cui et al., 2023) for both models. We also add a 5,000 subset of Ultrachat-200k (Ding et al., 2023) to improve multi-turn capability. |
| Dataset Splits | Yes | During SFT in the first stage, we use a learning rate of 8.0e-6 and batch size of 128 for both llama and gemma models. For LLa MA-3.1-8B-Instruct, we train for 6 epochs, and for Gemma-2-9B-it we train for 5 epochs. Packing is used as we found that it offers better improvement. In terms of the instruction set, we mainly utilize Ultrafeedback (Cui et al., 2023) for both models. We also add a 5,000 subset of Ultrachat-200k (Ding et al., 2023) to improve multi-turn capability. [...] We subsampled 6,000 instructions from Ultrafeedback as the preference optimization set for DPO. [...] The Ultrafeedback dataset comprises roughly 61,000 training instructions, while from the larger Ultrachat dataset of 200,000 instructions, we subsampled 60,000 to maintain scale parity with Ultrafeedback. The combined set, UF + UC, integrates all Ultrafeedback instructions with an additional 5,000 from Ultrachat. |
| Hardware Specification | Yes | All experiments are done on one node of 8x A100. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies or their version numbers, such as PyTorch, TensorFlow, or CUDA versions. |
| Experiment Setup | Yes | During SFT in the first stage, we use a learning rate of 8.0e-6 and batch size of 128 for both llama and gemma models. For LLa MA-3.1-8B-Instruct, we train for 6 epochs, and for Gemma-2-9B-it we train for 5 epochs. Packing is used as we found that it offers better improvement. [...] For DPO in the second stage, we use a learning rate of 8.0e-7 for the llama model and a learning rate of 3.0e-7 for the gemma model. We use a β value of 0.01 for both models. |