reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Evolving Tools for Large Language Models

Authors: Guoxin Chen, Zhong Zhang, Xin Cong, Fangda Guo, Yesai Wu, Yankai Lin, Wenzheng Feng, Yasheng Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness and stability of our approach, highlighting the importance of adaptability to tool variability for effective tool learning.1
Researcher Affiliation	Collaboration	1Institute of Computing Technology, Chinese Academy of Sciences 2Tsinghua University 3Renmin University of China 4Huawei Noah s Ark Lab
Pseudocode	Yes	Algorithm 1 delineates our customized MCTS process.
Open Source Code	Yes	1Our code is available at https://github.com/Chen-GX/Tool EVO.
Open Datasets	Yes	Furthermore, for research purposes, we construct a new benchmark Tool QA-D based on Tool QA (Zhuang et al., 2023b) to investigate the impact of tool variability. ... 7The Tool QA-D benchmark is provided at https://github.com/Chen-GX/Tool EVO.
Dataset Splits	Yes	Ultimately, our Tool QA-D comprises 7 datasets and 3 sets of API usage (Pc, Psin and Ps OOD), accompanied by a total of 6,234 and 5,884 training samples, 700 and 700 development samples, and 700 and 730 test samples for Easy and Hard difficulty respectively.
Hardware Specification	Yes	All experiments are conducted on Ubuntu 22.04 equipped with NVIDIA A100 GPUs.
Software Dependencies	Yes	Our code mainly depends on python 3.114 and Py Torch 2.3.05.
Experiment Setup	Yes	For MCTS, we set cpuct to 1.25, consistent with Silver et al. (2016). We limit the maximum depth of each tree to 15, and set k to 5, which indicates that we will expand 5 child nodes during the expansion phase. ... For self-improved training, we configure a batch size of 512, a learning rate of 2e-5, and specify the training epoch of 8. ... We set the maximum sequence length to 1024 and use cosine learning rate scheduler with a warm up rate of 0.03.