reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RMath: A Logic Reasoning-Focused Datasets Toward Mathematical Multistep Reasoning Tasks

Authors: Ziyi Hu, Jun Liu, Zhongzhi Liu, Yuzhong Liu, Zheng Xie, Yiping Song

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we evaluate RMath on several popular LLMs and present the corresponding results. Experiments LLMs on RMath Experimental setting. We use RMath to train and test a range of large models with parameter sizes ranging from 7 billion, 8 billion, 13 billion, to 70 billion. ... Results. Table 2 shows the performance of various LLMs on our dataset RMath and other datasets related to mathematical problems, which assess the abilities of LLMs from different perspectives.
Researcher Affiliation	Academia	Ziyi Hu1, Jun Liu2, Zhongzhi Liu1, Yuzhong Liu1, Zheng Xie1, Yiping Song1* 1National University of Defense Technology, Changsha, China 2Sun Yat-sen University, Zhuhai, China
Pseudocode	Yes	The flow chart of proposition connection and judgment is shown in Figure 3, and the process is divided into nine steps. Initially, input the three types of propositions(step 1). Then starting with propositions in Class C, assume their truth values and connect them with propositions in Class A to check for contradictions. If there is a contradiction, revise the assumption for the propositions in Class C; if not, based on this assumption and according to the requirements in problems about the numbers of true or false propositions, hypothesize the truth values for all Class B propositions one by one. Check if the hypotheses of propositions in Class B contradict with each other. Connect them respectively with propositions in Class A (Loop-A-Contradict-B : step 4-6) and Class C and check for contradictions. If there is a contradiction, check if all the hypothesis combinations of truth values of propositions in Class B are cycled;if not, re-assume for propositions in Class B; otherwise, re-assume for propositions in Class C (Loop-C-Tra-B:step 4-8), output the correct answer, the proposition in Class C that is true and consistent with the requirements in problems. Step 1: Input propositions in Class A, B, and C. Step 2: Assume truth or false value for propositions in Class C. ... Step 9: Output the true propositions in Class C as the correct answer.
Open Source Code	Yes	Our dataset and code are available at: https://github.com/huziyi19/RMath
Open Datasets	Yes	In this paper, we construct RMath1, a dataset specifically for multistep reasoning tasks... Our dataset and code are available at: https://github.com/huziyi19/RMath
Dataset Splits	No	The paper mentions creating a training dataset 'RMath-train' from 'RMath' for prompt tuning, but it does not specify the exact percentages, sample counts, or methodology for splitting RMath into training, validation, or test sets for its evaluations or for the RMath-train creation.
Hardware Specification	No	The paper mentions evaluating LLMs with various parameter sizes (7 billion, 8 billion, 13 billion, to 70 billion) but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for these evaluations or for training.
Software Dependencies	No	The paper lists various LLM models used (e.g., Llama2, Llama3, Wizard Math, Meta Math, To RA) and mentions prompt tuning, but it does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	No	The paper describes the general experimental approach (prompt tuning on RMath-train) and the LLMs used, but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) that would be needed for reproduction.