reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

Authors: Yao Shu, Wenyang Hu, See-Kiong Ng, Bryan Kian Hsiang Low, Fei Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the efficacy of Ferret, following the practice in Fed KSeed (Qin et al., 2024). We primarily compare Ferret with other federated full-parameter tuning baselines, including both zeroth-order methods (e.g., Fed ZO (Fang et al., 2022) and Fed KSeed (Qin et al., 2024)) and first-order methods (e.g., Fed Avg (Mc Mahan et al., 2017)). Our evaluations use Data Juicer-1.3B (Chen et al., 2023) and LLa MA-3B (Touvron et al., 2023a) on the Natural Instructions (Wang et al., 2022) and Dolly-15K (Conover et al., 2023) datasets, as well as larger models (i.e., LLa MA2-7B and LLa MA2-13B (Touvron et al., 2023b)) on the Code Alpaca (Chaudhary, 2023) and GSM8K (Cobbe et al., 2021). Figure 1: Performance comparison of various federated fullparameter tuning algorithms on Natural Instructions dataset with LLa MA-3B. Our Ferret shows significantly improved scalability, with a 6.0 reduction in computational cost and 3.3 fewer convergence rounds than Fed KSeed, alongside a 106 reduction in communication overhead than Fed Avg, while achieving comparable test score. 5.4. Ablation Studies Convergence and Generalization of Ferret under Varying K. In Fig. 3, we present the convergence and generalization of Ferret under varying K on the Natural Instructions dataset with Data Juicer-1.3B, using the same experimental setup as described in Appx. C.1.
Researcher Affiliation	Collaboration	Yao Shu * 1 Wenyang Hu * 2 3 See-Kiong Ng 3 Bryan Kian Hsiang Low 3 Fei Yu 4 *Equal contribution 1Hong Kong University of Science and Technology (Guangzhou) 2SAP 3National University of Singapore 4Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ). Correspondence to: Yao Shu <EMAIL>.
Pseudocode	Yes	To answer this question, we introduce Ferret, federated fullparameter tuning at scale for LLMs, in Algo. 1. We present an overview of Ferret algorithm in Sec.3.1, followed by a detailed explanation of its key techniques in Sec. 3.2. Algorithm 1 Ferret Input: w0, N, R, T, K, η 1 for each round r [R] do 2 for each client j [N] in parallel do 3 if r > 1 then // Step 1 : Global Aggregation 4 Receive {s(i)}N i=1 and {γ(i) k }N,K i=1,k=1 5 Generate bases {v(i) k }N,K i=1,k=1 using {s(i)}N i=1 6 wr 1 wr 2 P i [N] PK k=1 γ(i) k v(i) k /N 7 wr,0 wr 8 for t [T] do // Step 2 : Local Updates 9 w(j) r,t w(j) r,t 1 η ℓ(w(j) r,t 1; x(j) r,t 1) // Step 3 : Projected Updates 10 Randomly set s(j) and generate bases {v(j) k }K k=1 11 (j) r w(j) r 1 w(j) r , compute {γ(j) k }K k=1 with (6) 12 Send s(j) and {γ(j) k }K k=1 to the central server
Open Source Code	Yes	Our implementation is available at https: //github.com/allen4747/Ferret.
Open Datasets	Yes	Our evaluations use Data Juicer-1.3B (Chen et al., 2023) and LLa MA-3B (Touvron et al., 2023a) on the Natural Instructions (Wang et al., 2022) and Dolly-15K (Conover et al., 2023) datasets, as well as larger models (i.e., LLa MA2-7B and LLa MA2-13B (Touvron et al., 2023b)) on the Code Alpaca (Chaudhary, 2023) and GSM8K (Cobbe et al., 2021).
Dataset Splits	Yes	For the NI dataset, we allocated 738 training tasks to individual clients for local updates and reserved 119 test tasks for global evaluation, reflecting a non-IID distribution. Meanwhile, for the Dolly-15K dataset, the final task was utilized for global evaluation, while the remaining tasks were distributed among 200 clients with varying levels of label distribution skew. ... To further demonstrate that Ferret can also improve the capability of larger LLMs for code generation and mathematical reasoning, we conducted more experiments using the Code Alpaca (Chaudhary, 2023) and GSM8K (Cobbe et al., 2021) datasets, following a similar federated setup. The Code Alpaca dataset (of around 8.0k samples) is a code dataset that consists of ten programming languages, including C, C#, C++, Go, Java, PHP, Pascal, Python, Scale, and X86-64 Assemble. We exclude the X86-64 Assembly data due to limited samples in the dataset. We uniformly randomly sampled 10% instances from the original data as the hold-out test set for evaluation, and we split the remaining 10% samples into nine subsets based on the programming language category and assign each subset to one client as its local training data. For GSM8K, its official train set is split into three subsets, where each client s dataset consists of grade school math questions randomly partitioned from the original dataset, forming a IID distribution. We use the official GSM8K test split as the evaluation dataset.
Hardware Specification	No	The paper mentions 'GPU memory footprint' in Section 5.3 and 'GPU memory constraints' in Section C.4 but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments. It only refers to memory usage rather than the hardware itself.
Software Dependencies	No	The paper mentions using specific models like 'Data Juicer-1.3B' and 'LLa MA-3B', and provides 'Hugging Face model paths', but it does not specify any software dependencies (e.g., programming languages, libraries, frameworks) with version numbers that would be needed for replication.
Experiment Setup	Yes	FL Settings. In each round of federated learning, 5% of clients were randomly selected to participate. Following the same practice in Fed KSeed (Qin et al., 2024), we set the total number of communication rounds to 40 for the NI dataset and 60 for Dolly-15K for all baselines. Due to the compelling efficiency of our method, we set the total number of communication rounds to 12 for the NI dataset and 20 for Dolly-15K for Ferret. However, for more complex tasks such as Code Alpaca and GSM8K, we run all algorithms, including our Ferret, for 20 rounds to ensure a fair comparison. First-order baselines trained locally for one epoch, and Fed KSeed trained for 200 steps, while our Ferret algorithm trained for 10 iterations (i.e., T = 10 in Algo. 1). The K value was set to 4096 for Fed KSeed. All approaches perform local update with a batchsize of 1 to reduce memory consumption. For each local update iteration in Ferret, we accumulate the gradients from 4 samples. C.1.1. Hyper-parameters. For Ferret, the local update learning rate η for each client is set to 1 10 4, where the selected learning rate is searched from [2 10 4, 1 10 4, 5 10 5]. The global aggregation learning rates on Natural Instruction and Dolly-15K are set to 10.0 and 3.0, respectively, which is search from [10.0, 5.0, 1.0]. C.1.2. Hyperparameters. For Fed ZO and Fed KSeed, the local update learning rate is set to 3 10 7 for all models. For Fed Avg on both LLa MA2-7B and LLa MA2-13B, the local update learning rate η for each client is set to 3 10 4, and the global aggregation learning rate is set to 1.0. For Ferret on LLa MA2-7B, the local update learning rate η is set to 3 10 4 and the global aggregation learning rate is set to 5.0. For Ferret on LLa MA2-13B, the local update learning rate η is set to 5 10 4 and the global aggregation learning rate is set to 10.0. The selected learning rate is searched from [5 10 4, 3 10 4, 1 10 4] and the selected global aggregation learning rates is searched from [10.0, 5.0, 1.0].