reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Task Arithmetic Through The Lens Of One-Shot Federated Learning

Authors: Zhixu Tao, Ian Mason, Sanjeev Kulkarni, Xavier Boix

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally Show That Federated Learning Algorithms Often Improve Task Arithmetic: We conduct experiments on vision-language model CLIP (Radford et al., 2021) for image classification tasks in Section 6, and on large language model Llama2 (Touvron et al., 2023) for instruction following, mathematical reasoning and code generation in Section 7. Our experiments confirm that adapting Federated Learning algorithms often improves the merged model s performance compared to the original Task Arithmetic approach.
Researcher Affiliation	Collaboration	Zhixu Silvia Tao EMAIL Operations Research and Financial Engineering Princeton University Ian Mason EMAIL Fujitsu Research of America
Pseudocode	No	The paper describes algorithms like Fed Nova, Fed GMA, Median, and CCLIP, but it presents their modifications to Task Arithmetic using mathematical equations and descriptive text, rather than structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at: https://github.com/Silvia Tao/one_shot_fl_task_arithmetic
Open Datasets	Yes	We use CLIP-Vi T-B-32 (Radford et al., 2021) as the pre-trained model and eight datasets: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011)
Dataset Splits	Yes	In order to conduct hyperparameter search, we randomly split 5% of GSM8K, MATH, Human Eval and Alpaca Eval into validation datasets. To determine the optimal scaling coefficient λ, we searched the range [0.05, 0.1, 0.15, . . . , 1.95, 2.0], selecting the value that produces the highest average normalized accuracy in the validation data sets.
Hardware Specification	Yes	All experiments on CLIP were conducted on a single NVIDIA V100 GPU. All experiments in this part were conducted on four NVIDIA V100 GPUs.
Software Dependencies	No	The paper refers to using specific models like CLIP-ViT-B-32 and Llama2, and mentions frameworks like Hugging Face, but does not explicitly list the specific versions of programming languages or libraries (e.g., Python, PyTorch, TensorFlow) used in their implementation.
Experiment Setup	Yes	For each dataset, we fine-tuned Vi T-B-32 using three different learning rates {1e 4, 1e 5, 1e 6} and four different numbers of epochs, resulting in a total of 12 distinct hyperparameter configurations per dataset. The selected numbers of epochs were chosen to roughly correspond to training for 1000, 2000, 3000 and 4000 iterations, assuming a batch size of 128 for each dataset. To determine the optimal scaling coefficient λ, we searched the range [0.05, 0.1, 0.15, . . . , 1.95, 2.0], selecting the value that produces the highest average normalized accuracy in the validation data sets.