reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Model Alignment Through Collective Intelligence of Open-Source Models

Authors: Junlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation results show that our approach can improve win rate of LLa MA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on Alpaca Eval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that Mo AA enables a self-improvement pipeline, where models finetuned on Mo A-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.
Researcher Affiliation	Collaboration	1Duke University 2Together AI 3University of Chicago 4Stanford University. Correspondence to: Junlin Wang <EMAIL>.
Pseudocode	No	The paper describes methods using equations and prompt templates (Appendix K), but does not present any formal pseudocode or algorithm blocks.
Open Source Code	No	Data and code will be released.
Open Datasets	Yes	Our evaluation primarily focuses on two benchmarks for assessing LLM alignment with human preferences: Alpaca Eval 2 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024). [...] To comprehensively assess multi-turn capabilities and performance across diverse domains, we additionally employ MT-Bench (Zheng et al., 2023). [...] In terms of the instruction set, we mainly utilize Ultrafeedback (Cui et al., 2023) for both models. We also add a 5,000 subset of Ultrachat-200k (Ding et al., 2023) to improve multi-turn capability.
Dataset Splits	Yes	During SFT in the first stage, we use a learning rate of 8.0e-6 and batch size of 128 for both llama and gemma models. For LLa MA-3.1-8B-Instruct, we train for 6 epochs, and for Gemma-2-9B-it we train for 5 epochs. Packing is used as we found that it offers better improvement. In terms of the instruction set, we mainly utilize Ultrafeedback (Cui et al., 2023) for both models. We also add a 5,000 subset of Ultrachat-200k (Ding et al., 2023) to improve multi-turn capability. [...] We subsampled 6,000 instructions from Ultrafeedback as the preference optimization set for DPO. [...] The Ultrafeedback dataset comprises roughly 61,000 training instructions, while from the larger Ultrachat dataset of 200,000 instructions, we subsampled 60,000 to maintain scale parity with Ultrafeedback. The combined set, UF + UC, integrates all Ultrafeedback instructions with an additional 5,000 from Ultrachat.
Hardware Specification	Yes	All experiments are done on one node of 8x A100.
Software Dependencies	No	The paper does not explicitly state specific software dependencies or their version numbers, such as PyTorch, TensorFlow, or CUDA versions.
Experiment Setup	Yes	During SFT in the first stage, we use a learning rate of 8.0e-6 and batch size of 128 for both llama and gemma models. For LLa MA-3.1-8B-Instruct, we train for 6 epochs, and for Gemma-2-9B-it we train for 5 epochs. Packing is used as we found that it offers better improvement. [...] For DPO in the second stage, we use a learning rate of 8.0e-7 for the llama model and a learning rate of 3.0e-7 for the gemma model. We use a β value of 0.01 for both models.