reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Authors: Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without task-specific training data.
Researcher Affiliation	Collaboration	Microsoft Singapore University of Technology and Design ΩPeking University Tsinghua University The Chinese University of Hong Kong, Shenzhen The Chinese University of Hong Kong
Pseudocode	Yes	As shown in Algorithm 1, we first build a taxonomy of human knowledge and capabilities using frontier LLMs (i.e., GPT-4) and human verification.
Open Source Code	No	The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it provide a direct link to a code repository. The provided URL "https://aka.ms/General AI" is a project page, not a code repository.
Open Datasets	Yes	Mathematical Reasoning: Mathematics is a common subject in many different disciplines. Hence, it is necessary to test the math reasoning ability of GLAN. We choose the two popular benchmarks for evaluation (i.e., GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b)).
Dataset Splits	Yes	GSM8K (Cobbe et al., 2021) is a high-quality math problem dataset that measures the basic multi-step mathematical reasoning ability. It contains around 7k problems for training and 1K problems for test. MATH (Hendrycks et al., 2021b) is a challenging math dataset that contains mathematics competition-level problems from AMC, AIME, etc. The 7.5k training and 5K test problems cover seven math subjects...
Hardware Specification	Yes	The training requires approximately 8 days using 32 A100 GPUs.
Software Dependencies	No	The paper mentions using GPT-4 and GPT-3.5 for data generation and Mistral 7B as the base model, but does not specify version numbers for key software libraries or frameworks (e.g., Python, PyTorch, CUDA) required to replicate the experiments.
Experiment Setup	Yes	We train our model for 3 epochs with a learning rate of 3e-6. The batch size is set to approximately 512 instruction-response pairs. We employ a dynamic batch size to ensure a constant total number of tokens per batch. We use a cosine learning rate schedule and we start with a linear warm-up of 1000 steps and the final learning rate is reduced to 0.