reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Augmenting Math Word Problems via Iterative Question Composing

Authors: Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew C Yao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-ofthe-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM.
Researcher Affiliation	Academia	1Institute for Interdisciplinary Information Sciences, Tsinghua University, 2Shanghai Qi Zhi Institute EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Iterative Question Composing
Open Source Code	Yes	Code https://github.com/iiis-ai/Iterative Question Composing
Open Datasets	Yes	Datasets https://huggingface.co/datasets/Vivacem/MMIQC
Dataset Splits	Yes	For a fair comparison, we first evaluate the fine-tuned models on MATH (Hendrycks et al. 2021a), a competition-level math word problems benchmark with 5000 test problems in a zero-shot setting. ... The seed dataset is constructed by the samples in the training set of MATH that do not contain Asymptote language in their question statements.
Hardware Specification	Yes	Employing the Deep Speed Zero-3 Stage (Rajbhandari et al. 2020), we fine-tune 7B models on one node of 8x A800 GPUs with micro batch-size at 8, and gradient accumulation at 4, 34B models on 2 nodes with micro batch-size at 4 and gradient accumulation at 4 and 70B models on 4 nodes with micro batch-size at 4 and gradient accumulation at 2, maintaining an effective batch size of 256.
Software Dependencies	No	The paper mentions "Hugging Face transformers library (Wolf et al. 2019)" and "Sym Py (Meurer et al. 2017)" but does not specify version numbers for these software dependencies as used in their experiments. It only cites the original papers.
Experiment Setup	Yes	We fine-tune all models on MMIQC for 1 epoch, using a 3% warm-up ratio linear learning rate schedule. For the choice of maximum learning rate, we do a simple hyperparameter selection experiment shown in Table 2 and determine it to be 1e-5. We use the BFloat16 numerical format during training. Employing the Deep Speed Zero-3 Stage (Rajbhandari et al. 2020), we fine-tune 7B models on one node of 8x A800 GPUs with micro batch-size at 8, and gradient accumulation at 4, 34B models on 2 nodes with micro batch-size at 4 and gradient accumulation at 4 and 70B models on 4 nodes with micro batch-size at 4 and gradient accumulation at 2, maintaining an effective batch size of 256.