reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can Textual Gradient Work in Federated Learning?

Authors: Minghui Chen, Ruinan Jin, Wenlong Deng, Yuanyuan Chen, Zhi Huang, Han Yu, Xiaoxiao Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions are fourfold. Firstly, we introduce a novel FL paradigm, Federated Textual Gradient (Fed Text Grad)... Secondly, building on this design, we conduct extensive experiments to explore the feasibility of federated textual gradients. Our findings highlight the importance of properly tuning key factors (e.g., local steps) in FL training to effectively integrate textual gradients. Thirdly, We highlight a major challenge... Last but not least, in response to this issue, we improve the vanilla variant of Fed Text Grad... Through this principled study, we enable the adoption of textual gradients in FL for optimizing LLMs, identify important issues, and pinpoint future directions. Our code is available on https://github.com/ubc-tea/Fed Text Grad. Section 3 EXPERIMENTAL INVESTIGATION
Researcher Affiliation	Academia	1The University of British Columbia 2Vector Institute 3Nanyang Technological University 4University of Pennsylvania
Pseudocode	Yes	Algorithm 1 Algorithm of Fed Text Grad Input: N clients indexed by i, B: local minibatch size, C: Client sampling rate. T: number of rounds Output: Updated Prompts P
Open Source Code	Yes	Our code is available on https://github.com/ubc-tea/Fed Text Grad.
Open Datasets	Yes	We evaluate Fed Text Grad on prompt optimization across three key tasks from the BBH benchmark (Srivastava et al., 2022): 1) BBH Object Counting, 2) BBH Multistep Arithmetic, and 3) GSM8k Math Problem (Cobbe et al., 2021). They are well-suited for assessing the effectiveness of prompt optimization in complex reasoning scenarios. For each dataset, we split it into training, validation, and test sets. We adopt the dataset preprocessing methodology outlined in (Yuksekgonul et al., 2024).
Dataset Splits	No	For each dataset, we split it into training, validation, and test sets. We adopt the dataset preprocessing methodology outlined in (Yuksekgonul et al., 2024). The training set is used for prompt optimization. The validation set is used for prompt selection and hyper-parameter tuning. The test set is used for reporting the final performance, thereby ensuring fair and rigorous evaluation. Unless otherwise specified, we use a default batch size of 3 with 3 local steps for tuned hyper-parameters, with batches sampled randomly with replacement. After each iteration, the same batch is evaluated in a loop. The prompt is updated only if the performance does not drop compared to the previous non-updated version. Under homogeneous FL settings, each dataset is randomly split into 3 clients, each having an equal number of training and validation samples.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments. It mentions the LLM models used (e.g., Llama-3.1-8B, GPT-4) but not the specific computing resources or hardware specifications on which these models were run or accessed.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	Unless otherwise specified, we use a default batch size of 3 with 3 local steps for tuned hyper-parameters, with batches sampled randomly with replacement. After each iteration, the same batch is evaluated in a loop. The prompt is updated only if the performance does not drop compared to the previous non-updated version. Under homogeneous FL settings, each dataset is randomly split into 3 clients, each having an equal number of training and validation samples.