Task Arithmetic Through The Lens Of One-Shot Federated Learning
Authors: Zhixu Tao, Ian Mason, Sanjeev Kulkarni, Xavier Boix
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally Show That Federated Learning Algorithms Often Improve Task Arithmetic: We conduct experiments on vision-language model CLIP (Radford et al., 2021) for image classification tasks in Section 6, and on large language model Llama2 (Touvron et al., 2023) for instruction following, mathematical reasoning and code generation in Section 7. Our experiments confirm that adapting Federated Learning algorithms often improves the merged model s performance compared to the original Task Arithmetic approach. |
| Researcher Affiliation | Collaboration | Zhixu Silvia Tao EMAIL Operations Research and Financial Engineering Princeton University Ian Mason EMAIL Fujitsu Research of America |
| Pseudocode | No | The paper describes algorithms like Fed Nova, Fed GMA, Median, and CCLIP, but it presents their modifications to Task Arithmetic using mathematical equations and descriptive text, rather than structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at: https://github.com/Silvia Tao/one_shot_fl_task_arithmetic |
| Open Datasets | Yes | We use CLIP-Vi T-B-32 (Radford et al., 2021) as the pre-trained model and eight datasets: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011) |
| Dataset Splits | Yes | In order to conduct hyperparameter search, we randomly split 5% of GSM8K, MATH, Human Eval and Alpaca Eval into validation datasets. To determine the optimal scaling coefficient λ, we searched the range [0.05, 0.1, 0.15, . . . , 1.95, 2.0], selecting the value that produces the highest average normalized accuracy in the validation data sets. |
| Hardware Specification | Yes | All experiments on CLIP were conducted on a single NVIDIA V100 GPU. All experiments in this part were conducted on four NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper refers to using specific models like CLIP-ViT-B-32 and Llama2, and mentions frameworks like Hugging Face, but does not explicitly list the specific versions of programming languages or libraries (e.g., Python, PyTorch, TensorFlow) used in their implementation. |
| Experiment Setup | Yes | For each dataset, we fine-tuned Vi T-B-32 using three different learning rates {1e 4, 1e 5, 1e 6} and four different numbers of epochs, resulting in a total of 12 distinct hyperparameter configurations per dataset. The selected numbers of epochs were chosen to roughly correspond to training for 1000, 2000, 3000 and 4000 iterations, assuming a batch size of 128 for each dataset. To determine the optimal scaling coefficient λ, we searched the range [0.05, 0.1, 0.15, . . . , 1.95, 2.0], selecting the value that produces the highest average normalized accuracy in the validation data sets. |