Augmenting Math Word Problems via Iterative Question Composing

Authors: Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew C Yao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-ofthe-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University, 2Shanghai Qi Zhi Institute EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Iterative Question Composing
Open Source Code Yes Code https://github.com/iiis-ai/Iterative Question Composing
Open Datasets Yes Datasets https://huggingface.co/datasets/Vivacem/MMIQC
Dataset Splits Yes For a fair comparison, we first evaluate the fine-tuned models on MATH (Hendrycks et al. 2021a), a competition-level math word problems benchmark with 5000 test problems in a zero-shot setting. ... The seed dataset is constructed by the samples in the training set of MATH that do not contain Asymptote language in their question statements.
Hardware Specification Yes Employing the Deep Speed Zero-3 Stage (Rajbhandari et al. 2020), we fine-tune 7B models on one node of 8x A800 GPUs with micro batch-size at 8, and gradient accumulation at 4, 34B models on 2 nodes with micro batch-size at 4 and gradient accumulation at 4 and 70B models on 4 nodes with micro batch-size at 4 and gradient accumulation at 2, maintaining an effective batch size of 256.
Software Dependencies No The paper mentions "Hugging Face transformers library (Wolf et al. 2019)" and "Sym Py (Meurer et al. 2017)" but does not specify version numbers for these software dependencies as used in their experiments. It only cites the original papers.
Experiment Setup Yes We fine-tune all models on MMIQC for 1 epoch, using a 3% warm-up ratio linear learning rate schedule. For the choice of maximum learning rate, we do a simple hyperparameter selection experiment shown in Table 2 and determine it to be 1e-5. We use the BFloat16 numerical format during training. Employing the Deep Speed Zero-3 Stage (Rajbhandari et al. 2020), we fine-tune 7B models on one node of 8x A800 GPUs with micro batch-size at 8, and gradient accumulation at 4, 34B models on 2 nodes with micro batch-size at 4 and gradient accumulation at 4 and 70B models on 4 nodes with micro batch-size at 4 and gradient accumulation at 2, maintaining an effective batch size of 256.