reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can LLMs Solve Longer Math Word Problems Better?

Authors: Xin Xu, Tong Xiao, Zitong Chao, Zhenya Huang, Can Yang, Yang Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study pioneers the investigation of Context Length Generalizability (Co Le G), which refers to the ability of LLMs to solve MWPs with extended narratives. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs featuring lengthy narratives, and propose two novel metrics to evaluate the efficacy and resilience of LLMs in tackling these problems. Our analysis of existing zero-shot prompting techniques with proprietary LLMs along with open-source LLMs reveals a general deficiency in Co Le G. [...] Our comprehensive results demonstrate the effectiveness of our proposed methods, showing improved performance on E-GSM.
Researcher Affiliation	Academia	1The Hong Kong University of Science and Technology 2University of Science and Technology of China EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes the Condition-Retrieving Instruction (Co Re) and the auxiliary task of extension in prose, but does not present them in a structured pseudocode or algorithm block format. For example, Section 3.1 describes Co Re, and Section 3.2 describes the extension auxiliary task without pseudocode.
Open Source Code	Yes	https://github.com/Xin XU-USTC/Co Le G-Math
Open Datasets	Yes	We introduce Extended Grade-School Math (E-GSM), a collection of MWPs featuring lengthy narratives [...] E-GSM will be released under MIT License for future research. [...] The test set can be accessed through this link [footnote 1 for https://github.com/openai/grade-school-math]. [...] Our entire training set includes 64,929 Co T data [...] This dataset will be made available under MIT License. [...] MAWPS (Koncel-Kedziorski et al., 2016) is a benchmark of MWPs, incorporating 238 test examples. It is under MIT License and can be found at https://github.com/LYHYF/MWPToolkit. SVAMP (Patel et al., 2021) includes 1,000 simple MWPs, which is available at https://github.com/LYH-YF/MWPToolkit. It is under MIT License. GSM-IC (Shi et al., 2023) is a variant of GSM8K, including MWPs with one irrelevant sentence. The dataset is available at https://github.com/google-research-datasets/GSM-IC.
Dataset Splits	Yes	The process commences with the GSM8K (Cobbe et al., 2021) test set. [...] During the r-th iteration (1 r R, where R is the total number of extension rounds), the i-th question from the preceding iteration (r 1), denoted as qr 1 i Qr 1, where Qr 1 denotes the set of extended variants after extension round r 1, is extended using 2-shot demonstrations with GPT-4-turbo. [...] We get D0 that incorporate 38,507 valid Co T data points (we also include the GSM8K training set in D0) and D1 that includes 26,422 Co T data for extended quesitons. The entire training set, represented as D = D0 D1, incorporates 64,929 Co T data. [...] the GSM8K test set is divided into three subsets of equal size based on question length and the accuracy is calculated over each subset.
Hardware Specification	Yes	Experiments for LLMs with sizes 7B and 13B are conducted on 4 H800 GPUs. For the larger 70B model, necessitating more computational power, the experiments are carried out on 8 H800 GPUs (80G).
Software Dependencies	No	The paper mentions using Python implicitly for machine learning frameworks like LLaMA-2 and Mistral-7B, but does not specify its version. It also mentions the 'Adam W optimizer' and the 'vLLM library' (with a link provided), but no specific version numbers are given for these software components. It also states 'we employ all-mpnet-base-v2 as the sentence embedding model' which is a model, not a software library with a version.
Experiment Setup	Yes	In our experiments with the LLa MA-2 backbone, the learning rates are chosen based on the model scale. For LLMs with parameters of 7B and 13B, the learning rate is set to 0.00002. For 70B LLMs, the learning rate is adjusted to 0.00001. For the Mistral-7B base model, the learning rate is further reduced to 0.000005 to maintain the stability of the training. [...] Batch sizes are also tailored to LLMs parameters to maximize the utilization of computational resources. Specifically, for the 70B model, we select a batch size of 24 per device. For models with 7B and 13B parameters, a larger batch size of 36 per device is chosen. All models undergo training for 3 epochs with Adam W optimizer with a 3% learning rate warmup.