reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Authors: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform a comprehensive analysis of the MAGPIE-generated data. To compare MAGPIE-generated data with other public instruction datasets (e.g., Share GPT, Wild Chat, Evol-Instruct, Ultra Chat, Open Hermes, Tulu V2-Mix, Gen QA), we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that using MAGPIE for supervised fine-tuning (SFT) solely can surpass the performance of previous public datasets utilized for both SFT and preference optimization...
Researcher Affiliation	Collaboration	University of Washington Allen Institute for AI
Pseudocode	No	The paper describes the MAGPIE pipeline in text within Section 2.1 and illustrates it with a diagram in Figure 1, but does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	https://magpie-align.github.io/ https://hf.co/magpie-align
Open Datasets	Yes	Links to the MAGPIE datasets are provided in the text. (Table 4 caption, Appendix A). Also, https://hf.co/magpie-align (header). The paper also extensively uses and compares against known public datasets like Share GPT (Chiang et al., 2023), Wild Chat (Zhao et al., 2024), Evol Instruct (Xu et al., 2023a), Ultra Chat (Ding et al., 2023), Open Hermes (Teknium, 2023a;b), Gen QA (Chen et al., 2024), Tulu V2 Mix (Ivison et al., 2023).
Dataset Splits	No	The paper states the total number of conversations used for supervised fine-tuning (e.g., 300K for MAGPIE-Air and MAGPIE-Pro) and preference optimization (100K), but does not explicitly provide information on how these datasets were split into training, validation, and test sets. It refers to external benchmarks for evaluation, not internal splits of its own data for model training.
Hardware Specification	Yes	We perform experiments on a server with four NVIDIA A100-SXM4-80GB GPUs, an AMD EPYC 7763 64-Core Processor, and 512 GB of RAM, using the VLLM inference framework (Kwon et al., 2023).
Software Dependencies	No	The paper mentions several software components like the VLLM inference framework, the tiktoken library, Axolotl, and Alignment Handbook, but does not provide specific version numbers for these, which is required for reproducible software dependencies.
Experiment Setup	Yes	Table 9 and Table 10 detail specific hyperparameters for supervised fine-tuning and preference tuning, respectively. For instance, supervised fine-tuning uses a Learning Rate of 2 x 10^-5, 2 Epochs, a Per-device Batch Size of 1, and an Adamw optimizer with βs = (0.9, 0.999) and ϵ = 10^-8, along with a cosine learning rate scheduler and 100 Warmup Steps.