reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

Authors: Xiaochuan Li, Zichun Yu, Chenyan Xiong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and MT-Bench demonstrate that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35% and 46.24% relatively. Our method also beats data synthesized by a stronger teacher model, GPT-4o. Further analysis confirms the benefits of teacher s learning to generate more influential training data in the student s improved learning, the advantages of local data influence in accurately measuring student preferences, and the robustness of Montessori Instruct across different student models.
Researcher Affiliation	Academia	Xiaochuan Li , Zichun Yu , Chenyan Xiong School of Software, Tsinghua University Language Technologies Institute, Carnegie Mellon University EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes methods and processes in narrative text and refers to equations and figures for illustration, but it does not contain explicit pseudocode blocks or algorithms.
Open Source Code	Yes	Our code and data are open-sourced at https://github.com/cxcscmu/Montessori-Instruct.
Open Datasets	Yes	We merge the text in instruction and input fields of Alpaca GPT-4 dataset (Taori et al., 2023), consisting of 52K entries, to create our seed pool. We use Alpaca Eval 2.0 (Dubois et al., 2024) as the in-domain evaluation to assess the model s instruction-following ability. Additionally, we evaluate the model s generalization performance across six out-of-domain tasks, including MT-Bench (Zheng et al., 2024), ARC-Challenge (25-shot) (Clark et al., 2018), GSM8K (8-shot) (Cobbe et al., 2021), Hella Swag (8-shot) (Zellers et al., 2019), GPQA (0-shot) (Rein et al., 2023), and MMLU (0-shot) (Hendrycks et al., 2020).
Dataset Splits	No	The paper states that 10K instruction-response pairs are synthesized to train the student models, and that Alpaca Eval 2.0 is used as 'in-domain evaluation' and MT-Bench as 'out-of-domain' evaluation. However, it does not explicitly specify how the 10K synthetic training data itself is split (e.g., into training/validation/test subsets), nor does it provide details on the splits for the evaluation datasets beyond referencing them as standard benchmarks.
Hardware Specification	Yes	In our experiments, we utilize 8 H100 GPUs to accelerate this process.
Software Dependencies	No	We use Hugging Face TRL codebase (von Werra et al., 2020) to perform both full parameters finetuning and direct preference optimization. For the 8B model, we employ the Hugging Face Accelerate codebase (Gugger et al., 2022) to facilitate FSDP training (Zhao et al., 2023). While these frameworks are mentioned, specific version numbers (e.g., TRL 0.7.1, Accelerate 0.20.0) are not provided in the text.
Experiment Setup	Yes	We employ the Adam W optimizer (Loshchilov & Hutter, 2019) with a WSD scheduler (Hu et al., 2024a). For SFT, the 8B model utilizes a maximum learning rate of 5e 6, while the 1B model uses 1e 5. The WSD scheduler is configured with a warmup ratio of 0.1, a stable ratio of 0.5, and a decay ratio of 0.4, with the learning rate decaying to one-thousandth of the maximum. The epoch is set to 1, batch size is set to 32 and the dropout is 0. For DPO, we use a learning rate of 1e 6, set β to 0.1, and use a batch size of 2, while other parameters remain the same as in SFT. All the parameters introduced in this section are summarized in Table 4.