Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Authors: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform a comprehensive analysis of the MAGPIE-generated data. To compare MAGPIE-generated data with other public instruction datasets (e.g., Share GPT, Wild Chat, Evol-Instruct, Ultra Chat, Open Hermes, Tulu V2-Mix, Gen QA), we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that using MAGPIE for supervised fine-tuning (SFT) solely can surpass the performance of previous public datasets utilized for both SFT and preference optimization...
Researcher Affiliation Collaboration University of Washington Allen Institute for AI
Pseudocode No The paper describes the MAGPIE pipeline in text within Section 2.1 and illustrates it with a diagram in Figure 1, but does not present structured pseudocode or algorithm blocks.
Open Source Code Yes https://magpie-align.github.io/ https://hf.co/magpie-align
Open Datasets Yes Links to the MAGPIE datasets are provided in the text. (Table 4 caption, Appendix A). Also, https://hf.co/magpie-align (header). The paper also extensively uses and compares against known public datasets like Share GPT (Chiang et al., 2023), Wild Chat (Zhao et al., 2024), Evol Instruct (Xu et al., 2023a), Ultra Chat (Ding et al., 2023), Open Hermes (Teknium, 2023a;b), Gen QA (Chen et al., 2024), Tulu V2 Mix (Ivison et al., 2023).
Dataset Splits No The paper states the total number of conversations used for supervised fine-tuning (e.g., 300K for MAGPIE-Air and MAGPIE-Pro) and preference optimization (100K), but does not explicitly provide information on how these datasets were split into training, validation, and test sets. It refers to external benchmarks for evaluation, not internal splits of its own data for model training.
Hardware Specification Yes We perform experiments on a server with four NVIDIA A100-SXM4-80GB GPUs, an AMD EPYC 7763 64-Core Processor, and 512 GB of RAM, using the VLLM inference framework (Kwon et al., 2023).
Software Dependencies No The paper mentions several software components like the VLLM inference framework, the tiktoken library, Axolotl, and Alignment Handbook, but does not provide specific version numbers for these, which is required for reproducible software dependencies.
Experiment Setup Yes Table 9 and Table 10 detail specific hyperparameters for supervised fine-tuning and preference tuning, respectively. For instance, supervised fine-tuning uses a Learning Rate of 2 x 10^-5, 2 Epochs, a Per-device Batch Size of 1, and an Adamw optimizer with βs = (0.9, 0.999) and ϵ = 10^-8, along with a cosine learning rate scheduler and 100 Warmup Steps.