reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

Authors: Yu Du, Fangyun Wei, Hongyang Zhang

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across various datasets demonstrate the superiority of our Any Tool over strong baselines such as Tool LLM and a GPT-4 variant tailored for tool utilization. For instance, Any Tool outperforms Tool LLM by +35.4% in terms of average pass rate on Tool Bench. Code is available at https://github.com/dyabel/Any Tool.
Researcher Affiliation	Collaboration	Yu Du * 1 Fangyun Wei * 2 Hongyang Zhang 3 1Tsinghua University 2Microsoft Research Asia 3University of Waterloo.
Pseudocode	No	The paper describes algorithms like DFSDT and CoT but does not present them in formal pseudocode blocks or explicitly labeled 'Algorithm' sections.
Open Source Code	Yes	Code is available at https://github.com/dyabel/Any Tool.
Open Datasets	Yes	We conduct experiments on two benchmarks: 1) Tool Bench (Qin et al., 2023b); and 2) our own benchmark, termed Any Tool Bench. ... To ensure that all queries in the benchmark, namely Tool Bench (Qin et al., 2023b), are solvable using certain APIs from the API pool, we conduct a manual review of all queries. ... The process of creating Any Tool Bench is detailed in Section A.8 of the appendix.
Dataset Splits	No	The paper does not explicitly provide training/validation dataset splits (e.g., percentages, sample counts, or specific predefined split citations) needed to reproduce data partitioning for their experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud instance types used for running experiments.
Software Dependencies	No	The paper mentions software like GPT-4, GPT-3.5, and Chat GLM but does not specify their version numbers or other ancillary software dependencies with versions.
Experiment Setup	Yes	For the solver implementing DFSDT, we set the maximum number of API calls to 10. Additionally, for our Any Tool, we establish a limit of 200,000 tokens for efficiency.