reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agent Skill Acquisition for Large Language Models via CycleQD

Authors: So Kuroki, Taishi Nakamura, Takuya Akiba, Yujin Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results from Agent Bench indicate that applying Cycle QD to LLAMA3-8B-INSTRUCT based models not only enables them to surpass traditional fine-tuning methods in coding, operating systems, and database tasks, but also achieves performance on par with GPT-3.5-TURBO, which potentially contains much more parameters, across these domains. Crucially, this enhanced performance is achieved while retaining robust language capabilities, as evidenced by its performance on widely adopted language benchmark tasks. We highlight the key design choices in Cycle QD, detailing how these contribute to its effectiveness. Furthermore, our method is general and can be applied to image segmentation models, highlighting its applicability across different domains. 4 EXPERIMENTS
Researcher Affiliation	Industry	So Kuroki, Taishi Nakamura, Takuya Akiba, Yujin Tang Sakana AI, Japan EMAIL
Pseudocode	Yes	Algorithm 1 Cycle QD
Open Source Code	Yes	1https://github.com/Sakana AI/Cycle QD
Open Datasets	Yes	We adopt the MBPP+ (Mostly Basic Python Programming) dataset from Eval Plus (Liu et al., 2023a) for coding skill development, and utilize the OS and DB datasets from Agent Bench (Liu et al., 2024) for training. For the coding task, we optimize and report the pass@1 metric, whereas in OS and DB tasks we use the success rate. Please refer to Appendix A.1.1 for extra and detailed task related setups. Besides to its applications for LLMs, Cycle QD serves as a versatile method for integrating expert models across various data modalities beyond text. For example, we include a vision question answering (VQA) task in addition to the CS tasks and find Cycle QD to be able to outperform the experts (see Section A.1.5). In this experiment, we go further beyond and extend Cycle QD to the fusion of multiple Segment Anything Models (SAM), which are state-of-the-art computer vision models designed for image segmentation tasks. Specifically, our objective is to merge pairs of SAM models, A and B, to create models whose capabilities encompass the skill sets of both A and B. For Camouflaged Object Segmentation, we use three datasets: COD10K (Fan et al., 2020a), CHAMELEON (Skurowski et al., 2018), and CAMO (Le et al., 2019). Following Fan et al. (2020a) we train on a combined dataset consisting of the 4040 training images from COD10K and CAMO for 20 epochs, randomly splitting 10% of the images from the training set for validation. The model is then tested on the 250 COMO test images. For Polyp Segmentation, we use two datasets: Kvasir (Jha et al., 2019) and CVC-Clinic DB/CVC612 (Bernal et al., 2015). Following Fan et al. (2020b), we divide the images into a 9:1 ratio for training and testing, resulting in 1450 training images. We then randomly split 20 % of the training set for validation. The model is trained for 30 epochs and tested on the 101 Kvasir test images. For Skin Lesion Segmentation, we use the ISIC 2017 dataset (Codella et al., 2018). We train the model on the 2000 training and 150 validation images for 30 epochs and evaluate it on the 600 test images. For Leaf Segmentation, we use the Leaf Disease Segmentation Dataset (Rath, 2023). We train the model on the 498 training image, using 80% for training and 20% for validation for 30 epochs, and evaluate it on 90 test images.
Dataset Splits	Yes	To ensure that the experts have the same task metrics across tasks, each dataset is split evenly into training and test splits. For OS, problems that could not be solved by either the expert models and GPT models are excluded beforehand to reduce computation cost. For Camouflaged Object Segmentation, we use three datasets: COD10K (Fan et al., 2020a), CHAMELEON (Skurowski et al., 2018), and CAMO (Le et al., 2019). Following Fan et al. (2020a) we train on a combined dataset consisting of the 4040 training images from COD10K and CAMO for 20 epochs, randomly splitting 10% of the images from the training set for validation. The model is then tested on the 250 COMO test images. For Polyp Segmentation, we use two datasets: Kvasir (Jha et al., 2019) and CVC-Clinic DB/CVC612 (Bernal et al., 2015). Following Fan et al. (2020b), we divide the images into a 9:1 ratio for training and testing, resulting in 1450 training images. We then randomly split 20 % of the training set for validation. The model is trained for 30 epochs and tested on the 101 Kvasir test images. For Skin Lesion Segmentation, we use the ISIC 2017 dataset (Codella et al., 2018). We train the model on the 2000 training and 150 validation images for 30 epochs and evaluate it on the 600 test images. For Leaf Segmentation, we use the Leaf Disease Segmentation Dataset (Rath, 2023). We train the model on the 498 training image, using 80% for training and 20% for validation for 30 epochs, and evaluate it on 90 test images.
Hardware Specification	Yes	We used NVIDIA H100 GPUs for our experiments. The gradient fine-tuning method (model #6 in Table 1) took approximately 200 GPU hours, and our method (model #11 in Table 1) took about 410 GPU hours (excluding the expert models training time).
Software Dependencies	Yes	We utilize llm-recipes (Fujii et al., 2024) (commit 606cdfb) for fine-tuning. We adopt the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95.
Experiment Setup	Yes	Hyper-parameters: We use the performances of the three experts to determine the lower and upper bounds of the BC dimension. Specifically, the lower bound is set at 85% of the performance achieved by the least proficient expert, while the upper bound is set at 115% of the performance achieved by the most proficient expert. All BCs are then evenly divided into 15 bins between the lower and upper bounds. We limit the number of models in each bin to one, and run Cycle QD for 1200 generations. Since the quality and BCs are alternated in each generation, this is equivalent to optimizing for the three skills for 400 generations each. See more detail in Appendix A.1.3. We set αlow = 0.5 and αhigh = 0.8 in Elite sampling, µ = 1.0 and σ = 0.03 in model merging-based crossover, and wmax = 0.3 in our SVD-based mutations. We utilize llm-recipes (Fujii et al., 2024) (commit 606cdfb) for fine-tuning. We adopt the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95. A global batch size of 64 is used across all fine-tuning processes. We employ cosine learning rate scheduling with a range of [4 10 6, 2 10 5], starting with a linear warmup for the first 10% of the total training steps. The OS and DB experts are trained for 1 epoch, while the code model is trained for 3 epochs due to its larger training data size.