reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Collaborative Discrete-Continuous Black-Box Prompt Learning for Language Models

Authors: Hualin Zhang, Haozhen Zhang, Zhekai Liu, Bin Gu, Yi Chang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments on different datasets demonstrate significant improvements in various tasks compared to the baselines. Through extensive experiments on various datasets, we demonstrate that ZO-Po G significantly improves the performance of PTMs.
Researcher Affiliation	Academia	1Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence 2School of Artificial Intelligence, Jilin University 3School of Mathematics, Jilin University 4International Center of Future Science, Jilin University 5Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE
Pseudocode	Yes	Algorithm 1 Black-Box Prompt Learning via Zeroth-Order and Policy Gradient Method
Open Source Code	Yes	Our code is available at: https://github.com/zhanghualin0/ZO-Po G.
Open Datasets	Yes	For performance evaluation, we chose 5 commonly utilized datasets from the GLUE benchmark (Wang et al., 2018): Co LA (Warstadt et al., 2018), MNLI (Williams et al., 2017), QNLI (Wang et al., 2019), SNLI (Bowman et al., 2015), and WNLI (Levesque et al., 2012). These datasets encompass various typical language understanding tasks such as natural language inference. ... we conducted experiments on a challenging mathematical problem-solving dataset GSM8K (Cobbe et al., 2021) in the 64-shot setting.
Dataset Splits	Yes	All experiments are performed under the few-shot learning setting. We assemble the training and development sets by randomly selecting m instances for each class from the original training data. ... 16-shot (per class) setting ... 64-shot setting.
Hardware Specification	Yes	The experiments are executed on a cluster of NVIDIA A40 GPUs.
Software Dependencies	No	We employ Ro BERTa-large (Liu et al., 2019), GPT2-XL (Radford et al., 2019), and Llama3 (AI@Meta, 2024) as our backbone models, and all pre-trained weights are sourced directly from Hugging Face.
Experiment Setup	Yes	Comprehensive details of the input templates and hyperparameters used in our experiments can be found in Appendix B. ... Table 6: Main hyperparameters used in our algorithms.