Automated Creation of Reusable and Diverse Toolsets for Enhancing LLM Reasoning
Authors: Zhiyuan Ma, Zhenya Huang, Jiayu Liu, Minmao Wang, Hongke Zhao, Xin Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on 9 datasets across three challenging reasoning tasks: mathematics MATH (Hendrycks et al. 2021), table question-answering Tab MWP (Lu et al. 2023), and science problems SCIBENCH (Wang et al. 2024a). We observe that KTCE consistently outperformed competitive baselines, improving reasoning accuracy by 6.23% to 18.49%. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3College of Management and Economics, Tianjin University 4i FLYTEK AI Research EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: The KTCE Framework Input: Dataset D, toolset size k, max iter N Output: Optimized toolset T |
| Open Source Code | Yes | Code https://github.com/zhymma/KTCE |
| Open Datasets | Yes | Mathematical Reasoning: We use the MATH dataset (Hendrycks et al. 2021) to test LLMs text-based numerical reasoning. Tabular Reasoning: The Tab MWP dataset (Lu et al. 2023) is employed to assess LLMs capability in processing structured tabular data and performing reasoning calculations. Scientific Reasoning: We utilize SCIBENCH dataset (Wang et al. 2024a) to examine numerical reasoning abilities in complex scientific contexts. |
| Dataset Splits | Yes | Mathematical Reasoning: We use the MATH dataset (Hendrycks et al. 2021)... It has 7,500 training and 5,000 test problems... Tabular Reasoning: The Tab MWP dataset (Lu et al. 2023)... with a test set of 1,000 problems. Scientific Reasoning: We utilize SCIBENCH dataset (Wang et al. 2024a)... We randomly select 100 problems for testing and used the remainder for training. |
| Hardware Specification | No | The paper mentions using 'GPT-3.5-Turbo', 'GPT-4o Mini', and 'Deep Seek-Coder' for experiments, but does not provide specific hardware details like GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software components like 'GPT-3.5-Turbo', 'BGE-M3 model (Chen et al. 2024)', 'Sym Py', 'Pandas', 'GPT-4o Mini', and 'Deep Seek-Coder', but does not provide specific version numbers for these tools or libraries. |
| Experiment Setup | Yes | In Section 3.2, we extract max 3 knowledge triplets K per problem. For Section 3.3, we sample up to 100 problems per topic-concept pair, with max 5 iterations. Early stopping is applied if L didn t decrease for 3 consecutive iterations. Desired toolset size k is 10. For KA (in Section 3.4) and all baselines, to better utilize the tools, we allow up to three sampling attempts until the solution code successfully compiles. |