Improving Model Alignment Through Collective Intelligence of Open-Source Models

Authors: Junlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation results show that our approach can improve win rate of LLa MA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on Alpaca Eval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that Mo AA enables a self-improvement pipeline, where models finetuned on Mo A-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.
Researcher Affiliation Collaboration 1Duke University 2Together AI 3University of Chicago 4Stanford University. Correspondence to: Junlin Wang <EMAIL>.
Pseudocode No The paper describes methods using equations and prompt templates (Appendix K), but does not present any formal pseudocode or algorithm blocks.
Open Source Code No Data and code will be released.
Open Datasets Yes Our evaluation primarily focuses on two benchmarks for assessing LLM alignment with human preferences: Alpaca Eval 2 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024). [...] To comprehensively assess multi-turn capabilities and performance across diverse domains, we additionally employ MT-Bench (Zheng et al., 2023). [...] In terms of the instruction set, we mainly utilize Ultrafeedback (Cui et al., 2023) for both models. We also add a 5,000 subset of Ultrachat-200k (Ding et al., 2023) to improve multi-turn capability.
Dataset Splits Yes During SFT in the first stage, we use a learning rate of 8.0e-6 and batch size of 128 for both llama and gemma models. For LLa MA-3.1-8B-Instruct, we train for 6 epochs, and for Gemma-2-9B-it we train for 5 epochs. Packing is used as we found that it offers better improvement. In terms of the instruction set, we mainly utilize Ultrafeedback (Cui et al., 2023) for both models. We also add a 5,000 subset of Ultrachat-200k (Ding et al., 2023) to improve multi-turn capability. [...] We subsampled 6,000 instructions from Ultrafeedback as the preference optimization set for DPO. [...] The Ultrafeedback dataset comprises roughly 61,000 training instructions, while from the larger Ultrachat dataset of 200,000 instructions, we subsampled 60,000 to maintain scale parity with Ultrafeedback. The combined set, UF + UC, integrates all Ultrafeedback instructions with an additional 5,000 from Ultrachat.
Hardware Specification Yes All experiments are done on one node of 8x A100.
Software Dependencies No The paper does not explicitly state specific software dependencies or their version numbers, such as PyTorch, TensorFlow, or CUDA versions.
Experiment Setup Yes During SFT in the first stage, we use a learning rate of 8.0e-6 and batch size of 128 for both llama and gemma models. For LLa MA-3.1-8B-Instruct, we train for 6 epochs, and for Gemma-2-9B-it we train for 5 epochs. Packing is used as we found that it offers better improvement. [...] For DPO in the second stage, we use a learning rate of 8.0e-7 for the llama model and a learning rate of 3.0e-7 for the gemma model. We use a β value of 0.01 for both models.