Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
Authors: Yu Du, Fangyun Wei, Hongyang Zhang
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across various datasets demonstrate the superiority of our Any Tool over strong baselines such as Tool LLM and a GPT-4 variant tailored for tool utilization. For instance, Any Tool outperforms Tool LLM by +35.4% in terms of average pass rate on Tool Bench. Code is available at https://github.com/dyabel/Any Tool. |
| Researcher Affiliation | Collaboration | Yu Du * 1 Fangyun Wei * 2 Hongyang Zhang 3 1Tsinghua University 2Microsoft Research Asia 3University of Waterloo. |
| Pseudocode | No | The paper describes algorithms like DFSDT and CoT but does not present them in formal pseudocode blocks or explicitly labeled 'Algorithm' sections. |
| Open Source Code | Yes | Code is available at https://github.com/dyabel/Any Tool. |
| Open Datasets | Yes | We conduct experiments on two benchmarks: 1) Tool Bench (Qin et al., 2023b); and 2) our own benchmark, termed Any Tool Bench. ... To ensure that all queries in the benchmark, namely Tool Bench (Qin et al., 2023b), are solvable using certain APIs from the API pool, we conduct a manual review of all queries. ... The process of creating Any Tool Bench is detailed in Section A.8 of the appendix. |
| Dataset Splits | No | The paper does not explicitly provide training/validation dataset splits (e.g., percentages, sample counts, or specific predefined split citations) needed to reproduce data partitioning for their experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions software like GPT-4, GPT-3.5, and Chat GLM but does not specify their version numbers or other ancillary software dependencies with versions. |
| Experiment Setup | Yes | For the solver implementing DFSDT, we set the maximum number of API calls to 10. Additionally, for our Any Tool, we establish a limit of 200,000 tokens for efficiency. |