reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Black-Box Prompt Learning for Pre-trained Language Models

Authors: Shizhe Diao, Zhichao Huang, Ruijia Xu, Xuechun Li, LIN Yong, Xiao Zhou, Tong Zhang

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on RoBERTa and GPT-3 demonstrate that the proposed algorithm achieves significant improvement on eight benchmarks in a cloud-device collaboration manner. Experimental results on two kinds of datasets, i.e., datasets without domain-shift and datasets with domain-shift, demonstrate the effectiveness of the proposed black-box discrete prompt learning, which significantly improves the performance over a generic pre-trained model and outperforms all baseline models on eleven datasets.
Researcher Affiliation	Collaboration	Shizhe Diao EMAIL The Hong Kong University of Science and Technology, Xuechun Li EMAIL University of California, San Diego, Joint with Google research
Pseudocode	Yes	Algorithm 1 The black-box discrete optimization procedures. Require: Input batch S, Label batch Y , Parameter of categorical distribution p1, , pn, Prediction model G, Loss function L. 1: for k I do 2: Sample j(k) 1 Cat(p1), , j(k) n Cat(pn) 3: T (k) = t(k) 1 t(k) n = V[j(k) 1 ] V[j(k) n ] 4: end for 5: Lavg = 1 I PI k=1 L(G[T (k), S], Y ) 6: for i n do 7: gvr pi = 1 I 1 PI k=1 pi log P(t(k) i )(L(G[T (k), S], Y ) Lavg) 8: pi proj C(pi η gvr pi ) 9: end for 10: return p1, pn
Open Source Code	Yes	1The code is available at https://github.com/shizhediao/Black-Box-Prompt-Learning.
Open Datasets	Yes	In order to examine the model s ability in generic classification tasks as well as domain-specific classification tasks, we include seven datasets from the GLUE benchmark (Wang et al., 2019): MNLI (Williams et al., 2018), QQP (Iyer et al., 2017), SST-2 (Socher et al., 2013), MRPC (Dolan & Brockett, 2005), Co LA (Warstadt et al., 2019), QNLI (Wang et al., 2019), RTE (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), and four domain-specific datasets: Citation Intent (Jurgens et al., 2018), Sci ERC (Luan et al., 2018), RCT (Dernoncourt & Lee, 2017), Hyper Partisan (Kiesel et al., 2019) from specific domains including computer science, biomedical science and news following Gururangan et al. (2020); Diao et al. (2021).
Dataset Splits	Yes	We follow Perez et al. (2021) to simulate a true k-shot learning setting. We randomly sample k data from the original training set for each class to construct the training set and another different k data to construct the validation set. The original validation set will be used as the test set. Because the size of the QQP and RCT validation sets is too large, we randomly sample 1K data to save costs.
Hardware Specification	Yes	For experiments on GPT-3, we directly call its APIs without any GPU for computation. For experiments on RoBERTa, they are conducted with NVIDIA 2080Ti GPUs with 11GB memory.
Software Dependencies	No	For RoBERTa-large experiments, we initialize it with pre-trained weights by Huggingface s Transformers library7. The paper mentions Huggingface's Transformers library and AdamW, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For BDPL, we optimize the prompts by Adam W (Loshchilov & Hutter, 2019) for 30 epochs with a learning rate of 1 10 4. The prompt length is 50, and the size of the candidate prompt list N is 100. Other hyper-parameters are detailed in the Appendix A.3.