Task Arithmetic Through The Lens Of One-Shot Federated Learning

Authors: Zhixu Tao, Ian Mason, Sanjeev Kulkarni, Xavier Boix

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally Show That Federated Learning Algorithms Often Improve Task Arithmetic: We conduct experiments on vision-language model CLIP (Radford et al., 2021) for image classification tasks in Section 6, and on large language model Llama2 (Touvron et al., 2023) for instruction following, mathematical reasoning and code generation in Section 7. Our experiments confirm that adapting Federated Learning algorithms often improves the merged model s performance compared to the original Task Arithmetic approach.
Researcher Affiliation Collaboration Zhixu Silvia Tao EMAIL Operations Research and Financial Engineering Princeton University Ian Mason EMAIL Fujitsu Research of America
Pseudocode No The paper describes algorithms like Fed Nova, Fed GMA, Median, and CCLIP, but it presents their modifications to Task Arithmetic using mathematical equations and descriptive text, rather than structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at: https://github.com/Silvia Tao/one_shot_fl_task_arithmetic
Open Datasets Yes We use CLIP-Vi T-B-32 (Radford et al., 2021) as the pre-trained model and eight datasets: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011)
Dataset Splits Yes In order to conduct hyperparameter search, we randomly split 5% of GSM8K, MATH, Human Eval and Alpaca Eval into validation datasets. To determine the optimal scaling coefficient λ, we searched the range [0.05, 0.1, 0.15, . . . , 1.95, 2.0], selecting the value that produces the highest average normalized accuracy in the validation data sets.
Hardware Specification Yes All experiments on CLIP were conducted on a single NVIDIA V100 GPU. All experiments in this part were conducted on four NVIDIA V100 GPUs.
Software Dependencies No The paper refers to using specific models like CLIP-ViT-B-32 and Llama2, and mentions frameworks like Hugging Face, but does not explicitly list the specific versions of programming languages or libraries (e.g., Python, PyTorch, TensorFlow) used in their implementation.
Experiment Setup Yes For each dataset, we fine-tuned Vi T-B-32 using three different learning rates {1e 4, 1e 5, 1e 6} and four different numbers of epochs, resulting in a total of 12 distinct hyperparameter configurations per dataset. The selected numbers of epochs were chosen to roughly correspond to training for 1000, 2000, 3000 and 4000 iterations, assuming a batch size of 128 for each dataset. To determine the optimal scaling coefficient λ, we searched the range [0.05, 0.1, 0.15, . . . , 1.95, 2.0], selecting the value that produces the highest average normalized accuracy in the validation data sets.