MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Authors: Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data. We further identify an effective way to format and integrate synthetic and raw data during pretraining to maximize the gain in mathematical reasoning, emphasizing the need to restructure raw data rather than use it as-is. Compared to pretraining just on raw data, a model pretrained on MIND-OWM shows significant boost in mathematical reasoning (GSM8K: +13.42%, MATH: +2.30%), including superior performance in specialized knowledge (MMLU: +4.55%, MMLU-STEM: +4.28%) and general purpose reasoning tasks (GENERAL REASONING: +2.51%).
Researcher Affiliation Collaboration Syeda Nahida Akter2 , Shrimai Prabhumoye1,3, John Kamalu1, Sanjeev Satheesh1 Eric Nyberg2, Mostofa Patwary1, Mohammad Shoeybi1, Bryan Catanzaro1 NVIDIA1, Carnegie Mellon University2, Boston University3 EMAIL, EMAIL
Pseudocode No The paper describes the MIND methodology using natural language and mathematical equations (e.g., "si,j = M(pi || rj)" and a block diagram in Figure 2), but it does not include a distinct, structured pseudocode block or algorithm section.
Open Source Code No The paper refers to a tool used in their experiments: "We use the Tensor RT-LLM toolkit to deploy large scale generation1. 1https://github.com/NVIDIA/Tensor RT-LLM". This is a third-party tool, not the implementation code for the MIND methodology itself. There is no explicit statement or link indicating that the authors' own code for MIND is open-source or available.
Open Datasets Yes Specifically, using MIND, we generate synthetic conversations based on Open Web Math (OWM), resulting in a new math corpus, MIND-OWM. ... To evaluate the effectiveness of S in pretraining, we conduct continuous pretraining on a base LLM, C, to minimize the computational costs associated with full pretraining. ... We choose Open Web Math (Paster et al., 2023) as our seed corpus, R, which contains 14.7B tokens of high quality mathematical web text. ... We consider a new seed corpus, MATHPILE (Wang et al., 2023), that consists of 9.3B tokens extracted from high-quality data sources such as Ar Xiv papers, textbooks, Stack Exchange, Wikipedia, Proof Wiki, and Common Crawl pages.
Dataset Splits No The paper describes a data *blend* for continuous pretraining: "blend D consists of 2:1 ratio of Open Web Math (33B tokens) either raw (R) or synthetic (S ) and 13 snapshots of Common Crawl (17B tokens) (Rpt)". It also mentions evaluating on benchmarks in "0-shot" and "few-shot" manners. However, it does not provide explicit training/validation/test splits of its main pretraining data (OWM) for its own experimental evaluation.
Hardware Specification Yes To prepare a base model, we pretrain a 7B LLM on our pretraining data blend till 700B tokens using 512 H100 80GB SXM5 GPUs.
Software Dependencies No The paper mentions several software components like "Tensor RT-LLM toolkit", "NVIDIA s Megatron-LM repository", "Adam W optimizer", "Tiktoken tokenizer", "Swi GLU", "grouped query attention (GQA)", and "LM Eval Harness". However, it does not provide specific version numbers for any of these components, which are required for a reproducible description of ancillary software.
Experiment Setup Yes To generate conversation, we consider zero-shot prompting M, where we only pass a basic prompt (Appendix A.1) and the raw text. We sample conversations with temperature=1.0 and top_p=0.9 where the total number of input-output tokes is limited to 4096. ... During training, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.95 and weight decay of 0.1. ... We set the maximum value of learning rate to 3e-4, minimum to 3e-6, and use a batch size of 6M tokens with a 4096 context length.