reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Template-Driven LLM-Paraphrased Framework for Tabular Math Word Problem Generation

Authors: Xiaoqiang Kang, Zimu Wang, Xiaobo Jin, Wei Wang, Kaizhu Huang, Qiufeng Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through the proposed framework, we construct a high-quality dataset Tab MWPTe LL by adhering to the question types in the Tab MWP dataset, and we conduct extensive experiments on a variety of LLMs to demonstrate the effectiveness of Tab MWP-Te LL in improving TMWP-solving performance.
Researcher Affiliation	Academia	1School of Advanced Technology, Xi an Jiaotong-Liverpool University 2University of Liverpool 3Duke Kunshan University EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes through figures (Figure 2, Figure 3, Figure 4) and textual descriptions, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Jason8Kang/TELL
Open Datasets	Yes	We conduct evaluations on Tab MWP (Lu et al. 2023b), a recent large-scale dataset containing 38, 431 grade-level MWPs with tabular context, whose statistics are presented in Table 1.
Dataset Splits	Yes	We conduct evaluations on Tab MWP (Lu et al. 2023b), a recent large-scale dataset containing 38, 431 grade-level MWPs with tabular context, whose statistics are presented in Table 1. Table 1: Statistics of the Tab MWP dataset. Train Valid Test Total #Question 23, 059 7, 686 7, 686 38, 431
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA Ge Force RTX 3090 graphics cards.
Software Dependencies	No	The paper mentions using XTuner for QLoRA, and specific LLMs (Yi, Mistral, Qwen 2, Llama 3) but does not provide specific version numbers for these or other key software libraries like Python or PyTorch, which would be necessary for reproducibility.
Experiment Setup	Yes	During the fine-tuning process, we set the number of epochs as 2, the batch size per device as 12, the gradient accumulation steps as 4, and the learning rate as 2e 4.