reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning

Authors: Ganyu Wang, Jinjie Fang, Maxwell Juncheng Yin, Bin Gu, Xi Chen, Boyu Wang, Yi Chang, Charles Ling

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results. We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results. We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.
Researcher Affiliation	Academia	1Western University, London, Ontario, Canada 2Jilin University, Changchun, Jilin, China 3Mc Gill University, Montreal, Quebec, Canada 4Vector Institute, Toronto, Ontario, Canada. Correspondence to: Bin Gu <EMAIL>, Charles Ling <EMAIL>.
Pseudocode	Yes	Algorithm 1 outlines the Fed-BDPL framework, which integrates federated averaging with local client training with Gumbel-Softmax-BDPL (GS-BDPL).
Open Source Code	Yes	The implementation is available at: https://github.com/GanyuWang/FedOne-BDPL.
Open Datasets	Yes	To illustrate the intuition behind Fed One, we began with a toy experiment examining the trade-off between query efficiency and the number of activated clients K in a federated learning setting using the MNIST dataset (Le Cun et al., 2010).2 The dataset is evenly distributed across 100 clients. For our experiment, we utilized the GLUE benchmark (Wang et al., 2018), which includes a wide range of tasks including MNLI (Williams et al., 2018), QQP (Iyer et al., 2017), SST-2 (Socher et al., 2013), MRPC (Dolan & Brockett, 2005), Co LA (Warstadt et al., 2019), QNLI (Wang et al., 2018), and RTE (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009).
Dataset Splits	Yes	Table 5: The statistics and metrics of seven datasets in GLUE benchmark, \|L\|: number of classes for classification tasks. Dataset \|L\| \|Train\| \|Dev\| \|Test\| Type Metrics Domain MNLI 3 393K 9.8K 9.8K NLI acc. fiction, reports QQP 2 364K 40K 391K paraphrase F1 Quora SST-2 2 6.7K 872 1.8K sentiment acc. movie reviews MRPC 2 3.7K 408 1.7K paraphrase F1 news Co LA 2 8.6K 1K 1K acceptability Matthews corr. books, articles QNLI 2 105K 5.5K 5.5K NLI acc. Wikipedia RTE 2 2.5K 277 3K NLI acc. news, Wikipedia
Hardware Specification	No	The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It only refers to 'GPUs' in a general discussion about computational resources.
Software Dependencies	No	The paper mentions 'Ro BERTa-large' as the model architecture and 'Adam W' as the optimization algorithm, and 'Open AI API' for GPT-3.5 Turbo. However, it does not provide specific version numbers for any programming languages, libraries, or other software components.
Experiment Setup	Yes	We train for 2 epochs with a learning rate of 0.01 and a batch size of 32, varying the number of active clients, i.e., K {1, 5, 10, 20, 40}. The model for each client is a Multilayer Perceptron (MLP). It includes a flattening input layer, a fully connected layer with 512 neurons and ReLU activation, a dropout layer with 0.2 dropout rate, and a final fully connected layer that outputs to 10 classes via a Softmax function. For the training procedure, we conducted a hyperparameter tuning phase using a grid search approach to explore learning rates of [3e 4, 1e 4, 3e 5, 1e 5]. The batch size was set at 32, and the optimization algorithm employed was Adam W (Loshchilov & Hutter, 2017). For every client, the population size of CMA-ES is set to 20, and the dimension of the lowdimensional vector is set to 500, as recommended by (Sun et al., 2022).