reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Authors: Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data. We further identify an effective way to format and integrate synthetic and raw data during pretraining to maximize the gain in mathematical reasoning, emphasizing the need to restructure raw data rather than use it as-is. Compared to pretraining just on raw data, a model pretrained on MIND-OWM shows significant boost in mathematical reasoning (GSM8K: +13.42%, MATH: +2.30%), including superior performance in specialized knowledge (MMLU: +4.55%, MMLU-STEM: +4.28%) and general purpose reasoning tasks (GENERAL REASONING: +2.51%).
Researcher Affiliation	Collaboration	Syeda Nahida Akter2 , Shrimai Prabhumoye1,3, John Kamalu1, Sanjeev Satheesh1 Eric Nyberg2, Mostofa Patwary1, Mohammad Shoeybi1, Bryan Catanzaro1 NVIDIA1, Carnegie Mellon University2, Boston University3 EMAIL, EMAIL
Pseudocode	No	The paper describes the MIND methodology using natural language and mathematical equations (e.g., "si,j = M(pi \|\| rj)" and a block diagram in Figure 2), but it does not include a distinct, structured pseudocode block or algorithm section.
Open Source Code	No	The paper refers to a tool used in their experiments: "We use the Tensor RT-LLM toolkit to deploy large scale generation1. 1https://github.com/NVIDIA/Tensor RT-LLM". This is a third-party tool, not the implementation code for the MIND methodology itself. There is no explicit statement or link indicating that the authors' own code for MIND is open-source or available.
Open Datasets	Yes	Specifically, using MIND, we generate synthetic conversations based on Open Web Math (OWM), resulting in a new math corpus, MIND-OWM. ... To evaluate the effectiveness of S in pretraining, we conduct continuous pretraining on a base LLM, C, to minimize the computational costs associated with full pretraining. ... We choose Open Web Math (Paster et al., 2023) as our seed corpus, R, which contains 14.7B tokens of high quality mathematical web text. ... We consider a new seed corpus, MATHPILE (Wang et al., 2023), that consists of 9.3B tokens extracted from high-quality data sources such as Ar Xiv papers, textbooks, Stack Exchange, Wikipedia, Proof Wiki, and Common Crawl pages.
Dataset Splits	No	The paper describes a data blend for continuous pretraining: "blend D consists of 2:1 ratio of Open Web Math (33B tokens) either raw (R) or synthetic (S ) and 13 snapshots of Common Crawl (17B tokens) (Rpt)". It also mentions evaluating on benchmarks in "0-shot" and "few-shot" manners. However, it does not provide explicit training/validation/test splits of its main pretraining data (OWM) for its own experimental evaluation.
Hardware Specification	Yes	To prepare a base model, we pretrain a 7B LLM on our pretraining data blend till 700B tokens using 512 H100 80GB SXM5 GPUs.
Software Dependencies	No	The paper mentions several software components like "Tensor RT-LLM toolkit", "NVIDIA s Megatron-LM repository", "Adam W optimizer", "Tiktoken tokenizer", "Swi GLU", "grouped query attention (GQA)", and "LM Eval Harness". However, it does not provide specific version numbers for any of these components, which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	To generate conversation, we consider zero-shot prompting M, where we only pass a basic prompt (Appendix A.1) and the raw text. We sample conversations with temperature=1.0 and top_p=0.9 where the total number of input-output tokes is limited to 4096. ... During training, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.95 and weight decay of 0.1. ... We set the maximum value of learning rate to 3e-4, minimum to 3e-6, and use a batch size of 6M tokens with a 4096 context length.