reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

Authors: Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we conduct careful ablation experiments on data synthesis using the recently released Llama3.1 family of models. Our experiments show that... Finetuning the Llama-3.1-8B-Base using Open Math Instruct-2 outperforms Llama3.1-8B-Instruct on MATH by an absolute 15.9% (51.9% 67.8%).
Researcher Affiliation	Industry	1NVIDIA EMAIL
Pseudocode	No	The paper describes methodologies in prose and provides prompt templates in the appendix, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	to accelerate the open-source efforts, we release the code, the finetuned models, and the Open Math Instruct-2 dataset under a commercially permissive license.1 ... Code is available at https://github.com/NVIDIA/Ne Mo-Skills
Open Datasets	Yes	we create the Open Math Instruct-2 dataset which consists of 14M question-solution pairs... Data and models are available at https://huggingface.co/collections/nvidia/ openmath-2-66fb142317d86400783d2c7b and also mentions MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021).
Dataset Splits	Yes	For these ablation experiments, we use the 1K validation split created from MATH (Hendrycks et al., 2021) training set by Toshniwal et al. (2024). The remaining 6.5K MATH training set problems are used to create the SFT dataset. ... In our setup, we use the test sets of four evaluation benchmarks, namely GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), AMC 2023 (AMC 2023, 2023), and AIME 2024 (AIME 2024, 2024). For the 8B model, we train the model on 1M, 2M, and 5M fair downsampled versions of Open Math Instruct-2 to understand the impact of the data scaling.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run its experiments, only mentioning the LLM models used.
Software Dependencies	No	The paper mentions the AdamW optimizer, but does not provide specific version numbers for key software components or libraries (e.g., Python, PyTorch, CUDA) required for replication.
Experiment Setup	Yes	For SFT, the model is trained for 4 epochs, with a batch size of 256, using the Adam W optimizer (Loshchilov and Hutter, 2019) with a constant learning rate of 5e-6 and a weight decay of 1e-2. ... All the models are trained with a batch size of 512, using the Adam W optimizer (Loshchilov and Hutter, 2019) with a constant learning rate of 2e-5 and a weight decay of 1e-2. ... The models are trained for 2 epochs, and we save 6 equally spaced checkpoints during the training runs, which are averaged to create the final model.