Can Textual Gradient Work in Federated Learning?

Authors: Minghui Chen, Ruinan Jin, Wenlong Deng, Yuanyuan Chen, Zhi Huang, Han Yu, Xiaoxiao Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our contributions are fourfold. Firstly, we introduce a novel FL paradigm, Federated Textual Gradient (Fed Text Grad)... Secondly, building on this design, we conduct extensive experiments to explore the feasibility of federated textual gradients. Our findings highlight the importance of properly tuning key factors (e.g., local steps) in FL training to effectively integrate textual gradients. Thirdly, We highlight a major challenge... Last but not least, in response to this issue, we improve the vanilla variant of Fed Text Grad... Through this principled study, we enable the adoption of textual gradients in FL for optimizing LLMs, identify important issues, and pinpoint future directions. Our code is available on https://github.com/ubc-tea/Fed Text Grad. Section 3 EXPERIMENTAL INVESTIGATION
Researcher Affiliation Academia 1The University of British Columbia 2Vector Institute 3Nanyang Technological University 4University of Pennsylvania
Pseudocode Yes Algorithm 1 Algorithm of Fed Text Grad Input: N clients indexed by i, B: local minibatch size, C: Client sampling rate. T: number of rounds Output: Updated Prompts P
Open Source Code Yes Our code is available on https://github.com/ubc-tea/Fed Text Grad.
Open Datasets Yes We evaluate Fed Text Grad on prompt optimization across three key tasks from the BBH benchmark (Srivastava et al., 2022): 1) BBH Object Counting, 2) BBH Multistep Arithmetic, and 3) GSM8k Math Problem (Cobbe et al., 2021). They are well-suited for assessing the effectiveness of prompt optimization in complex reasoning scenarios. For each dataset, we split it into training, validation, and test sets. We adopt the dataset preprocessing methodology outlined in (Yuksekgonul et al., 2024).
Dataset Splits No For each dataset, we split it into training, validation, and test sets. We adopt the dataset preprocessing methodology outlined in (Yuksekgonul et al., 2024). The training set is used for prompt optimization. The validation set is used for prompt selection and hyper-parameter tuning. The test set is used for reporting the final performance, thereby ensuring fair and rigorous evaluation. Unless otherwise specified, we use a default batch size of 3 with 3 local steps for tuned hyper-parameters, with batches sampled randomly with replacement. After each iteration, the same batch is evaluated in a loop. The prompt is updated only if the performance does not drop compared to the previous non-updated version. Under homogeneous FL settings, each dataset is randomly split into 3 clients, each having an equal number of training and validation samples.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments. It mentions the LLM models used (e.g., Llama-3.1-8B, GPT-4) but not the specific computing resources or hardware specifications on which these models were run or accessed.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes Unless otherwise specified, we use a default batch size of 3 with 3 local steps for tuned hyper-parameters, with batches sampled randomly with replacement. After each iteration, the same batch is evaluated in a loop. The prompt is updated only if the performance does not drop compared to the previous non-updated version. Under homogeneous FL settings, each dataset is randomly split into 3 clients, each having an equal number of training and validation samples.