reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

Authors: Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of Math Code-Pile on four popular base models: Llama-3-8B, Deep Seek Math-7B, Mistral-7B, and Code-Llama-7B, significantly improving their performance on five representative mathematical benchmarks. We evaluate the Math Coder2 models on five representative datasets: GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), SAT-Math (Azerbayev et al., 2024), OCW (Lewkowycz et al., 2022), and MMLU-Math (Hendrycks et al., 2021a). Finally, we conduct ablation studies to analyze the impact of each component of the dataset.
Researcher Affiliation	Academia	Zimu Lu 1, Aojun Zhou 1, Houxing Ren1, Ke Wang1, Weikang Shi1 Junting Pan1,2, Mingjie Zhan 1, Hongsheng Li 1,2 1Multimedia Laboratory (MMLab), The Chinese University of Hong Kong 2CPII under Inno HK EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes a data processing pipeline with textual steps and a diagram (Figure 1), and provides prompts used for model instruction. However, it does not include a clearly labeled pseudocode block or algorithm section.
Open Source Code	Yes	All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline.
Open Datasets	Yes	We start with the Open Web Math (Paster et al., 2023) dataset, which contains mathematical web pages sourced from Common Crawl. We collect synthetic data from various open-source repositories on Hugging Face, including datasets like Education-College-Students1, Maths-College2, and synthetic math books from Matrix (Zhang et al., 2024). We collect code from Python and Jupyter files within the Star Coder Data dataset (Li et al., 2023). We evaluate the Math Coder2 models on five representative datasets: GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), SAT-Math (Azerbayev et al., 2024), OCW (Lewkowycz et al., 2022), and MMLU-Math (Hendrycks et al., 2021a).
Dataset Splits	No	The paper describes using existing benchmarks with specific shot settings (e.g., "4-shot prompt") and a default zero-shot setting for evaluation, but it does not specify explicit training/validation/test splits for its own Math Code-Pile dataset or for the fine-tuning datasets like Numina Math-CoT and Numina Math-TIR.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for training or experimentation.
Software Dependencies	No	The paper mentions using "a fast Text classifier" and an "open-source library5 for training" (referring to fasttext.cc) but does not specify the version numbers for fastText or any other key software dependencies used in their methodology.
Experiment Setup	Yes	Math Coder2-Llama-3-8B is trained for 3 epochs with a global batch size of 4 million tokens and an 8192 token context length. Math Coder2-Deep Seek Math, Math Coder2-Mistral, and Math Coder2-Code Llama are each trained for 3 epochs with a global batch size of 4 million tokens and a 4096 token context length. We train each corpus for 3 epochs with a global batch size of 2 million tokens and a 4096 token context length, since we observe that the model s performance usually saturates around 3 epochs.