Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Authors: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a comprehensive analysis of the MAGPIE-generated data. To compare MAGPIE-generated data with other public instruction datasets (e.g., Share GPT, Wild Chat, Evol-Instruct, Ultra Chat, Open Hermes, Tulu V2-Mix, Gen QA), we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that using MAGPIE for supervised fine-tuning (SFT) solely can surpass the performance of previous public datasets utilized for both SFT and preference optimization... |
| Researcher Affiliation | Collaboration | University of Washington Allen Institute for AI |
| Pseudocode | No | The paper describes the MAGPIE pipeline in text within Section 2.1 and illustrates it with a diagram in Figure 1, but does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://magpie-align.github.io/ https://hf.co/magpie-align |
| Open Datasets | Yes | Links to the MAGPIE datasets are provided in the text. (Table 4 caption, Appendix A). Also, https://hf.co/magpie-align (header). The paper also extensively uses and compares against known public datasets like Share GPT (Chiang et al., 2023), Wild Chat (Zhao et al., 2024), Evol Instruct (Xu et al., 2023a), Ultra Chat (Ding et al., 2023), Open Hermes (Teknium, 2023a;b), Gen QA (Chen et al., 2024), Tulu V2 Mix (Ivison et al., 2023). |
| Dataset Splits | No | The paper states the total number of conversations used for supervised fine-tuning (e.g., 300K for MAGPIE-Air and MAGPIE-Pro) and preference optimization (100K), but does not explicitly provide information on how these datasets were split into training, validation, and test sets. It refers to external benchmarks for evaluation, not internal splits of its own data for model training. |
| Hardware Specification | Yes | We perform experiments on a server with four NVIDIA A100-SXM4-80GB GPUs, an AMD EPYC 7763 64-Core Processor, and 512 GB of RAM, using the VLLM inference framework (Kwon et al., 2023). |
| Software Dependencies | No | The paper mentions several software components like the VLLM inference framework, the tiktoken library, Axolotl, and Alignment Handbook, but does not provide specific version numbers for these, which is required for reproducible software dependencies. |
| Experiment Setup | Yes | Table 9 and Table 10 detail specific hyperparameters for supervised fine-tuning and preference tuning, respectively. For instance, supervised fine-tuning uses a Learning Rate of 2 x 10^-5, 2 Epochs, a Per-device Batch Size of 1, and an Adamw optimizer with βs = (0.9, 0.999) and ϵ = 10^-8, along with a cosine learning rate scheduler and 100 Warmup Steps. |