OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Authors: Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we conduct careful ablation experiments on data synthesis using the recently released Llama3.1 family of models. Our experiments show that... Finetuning the Llama-3.1-8B-Base using Open Math Instruct-2 outperforms Llama3.1-8B-Instruct on MATH by an absolute 15.9% (51.9% 67.8%). |
| Researcher Affiliation | Industry | 1NVIDIA EMAIL |
| Pseudocode | No | The paper describes methodologies in prose and provides prompt templates in the appendix, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | to accelerate the open-source efforts, we release the code, the finetuned models, and the Open Math Instruct-2 dataset under a commercially permissive license.1 ... Code is available at https://github.com/NVIDIA/Ne Mo-Skills |
| Open Datasets | Yes | we create the Open Math Instruct-2 dataset which consists of 14M question-solution pairs... Data and models are available at https://huggingface.co/collections/nvidia/ openmath-2-66fb142317d86400783d2c7b and also mentions MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021). |
| Dataset Splits | Yes | For these ablation experiments, we use the 1K validation split created from MATH (Hendrycks et al., 2021) training set by Toshniwal et al. (2024). The remaining 6.5K MATH training set problems are used to create the SFT dataset. ... In our setup, we use the test sets of four evaluation benchmarks, namely GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), AMC 2023 (AMC 2023, 2023), and AIME 2024 (AIME 2024, 2024). For the 8B model, we train the model on 1M, 2M, and 5M fair downsampled versions of Open Math Instruct-2 to understand the impact of the data scaling. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run its experiments, only mentioning the LLM models used. |
| Software Dependencies | No | The paper mentions the AdamW optimizer, but does not provide specific version numbers for key software components or libraries (e.g., Python, PyTorch, CUDA) required for replication. |
| Experiment Setup | Yes | For SFT, the model is trained for 4 epochs, with a batch size of 256, using the Adam W optimizer (Loshchilov and Hutter, 2019) with a constant learning rate of 5e-6 and a weight decay of 1e-2. ... All the models are trained with a batch size of 512, using the Adam W optimizer (Loshchilov and Hutter, 2019) with a constant learning rate of 2e-5 and a weight decay of 1e-2. ... The models are trained for 2 epochs, and we save 6 equally spaced checkpoints during the training runs, which are averaged to create the final model. |