reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automated Creation of Reusable and Diverse Toolsets for Enhancing LLM Reasoning

Authors: Zhiyuan Ma, Zhenya Huang, Jiayu Liu, Minmao Wang, Hongke Zhao, Xin Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on 9 datasets across three challenging reasoning tasks: mathematics MATH (Hendrycks et al. 2021), table question-answering Tab MWP (Lu et al. 2023), and science problems SCIBENCH (Wang et al. 2024a). We observe that KTCE consistently outperformed competitive baselines, improving reasoning accuracy by 6.23% to 18.49%.
Researcher Affiliation	Collaboration	1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3College of Management and Economics, Tianjin University 4i FLYTEK AI Research EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: The KTCE Framework Input: Dataset D, toolset size k, max iter N Output: Optimized toolset T
Open Source Code	Yes	Code https://github.com/zhymma/KTCE
Open Datasets	Yes	Mathematical Reasoning: We use the MATH dataset (Hendrycks et al. 2021) to test LLMs text-based numerical reasoning. Tabular Reasoning: The Tab MWP dataset (Lu et al. 2023) is employed to assess LLMs capability in processing structured tabular data and performing reasoning calculations. Scientific Reasoning: We utilize SCIBENCH dataset (Wang et al. 2024a) to examine numerical reasoning abilities in complex scientific contexts.
Dataset Splits	Yes	Mathematical Reasoning: We use the MATH dataset (Hendrycks et al. 2021)... It has 7,500 training and 5,000 test problems... Tabular Reasoning: The Tab MWP dataset (Lu et al. 2023)... with a test set of 1,000 problems. Scientific Reasoning: We utilize SCIBENCH dataset (Wang et al. 2024a)... We randomly select 100 problems for testing and used the remainder for training.
Hardware Specification	No	The paper mentions using 'GPT-3.5-Turbo', 'GPT-4o Mini', and 'Deep Seek-Coder' for experiments, but does not provide specific hardware details like GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions software components like 'GPT-3.5-Turbo', 'BGE-M3 model (Chen et al. 2024)', 'Sym Py', 'Pandas', 'GPT-4o Mini', and 'Deep Seek-Coder', but does not provide specific version numbers for these tools or libraries.
Experiment Setup	Yes	In Section 3.2, we extract max 3 knowledge triplets K per problem. For Section 3.3, we sample up to 100 problems per topic-concept pair, with max 5 iterations. Early stopping is applied if L didn t decrease for 3 consecutive iterations. Desired toolset size k is 10. For KA (in Section 3.4) and all baselines, to better utilize the tools, we allow up to three sampling attempts until the solution code successfully compiles.