Black-Box Prompt Learning for Pre-trained Language Models

Authors: Shizhe Diao, Zhichao Huang, Ruijia Xu, Xuechun Li, LIN Yong, Xiao Zhou, Tong Zhang

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on RoBERTa and GPT-3 demonstrate that the proposed algorithm achieves significant improvement on eight benchmarks in a cloud-device collaboration manner. Experimental results on two kinds of datasets, i.e., datasets without domain-shift and datasets with domain-shift, demonstrate the effectiveness of the proposed black-box discrete prompt learning, which significantly improves the performance over a generic pre-trained model and outperforms all baseline models on eleven datasets.
Researcher Affiliation Collaboration Shizhe Diao EMAIL The Hong Kong University of Science and Technology, Xuechun Li EMAIL University of California, San Diego, Joint with Google research
Pseudocode Yes Algorithm 1 The black-box discrete optimization procedures. Require: Input batch S, Label batch Y , Parameter of categorical distribution p1, , pn, Prediction model G, Loss function L. 1: for k I do 2: Sample j(k) 1 Cat(p1), , j(k) n Cat(pn) 3: T (k) = t(k) 1 t(k) n = V[j(k) 1 ] V[j(k) n ] 4: end for 5: Lavg = 1 I PI k=1 L(G[T (k), S], Y ) 6: for i n do 7: gvr pi = 1 I 1 PI k=1 pi log P(t(k) i )(L(G[T (k), S], Y ) Lavg) 8: pi proj C(pi η gvr pi ) 9: end for 10: return p1, pn
Open Source Code Yes 1The code is available at https://github.com/shizhediao/Black-Box-Prompt-Learning.
Open Datasets Yes In order to examine the model s ability in generic classification tasks as well as domain-specific classification tasks, we include seven datasets from the GLUE benchmark (Wang et al., 2019): MNLI (Williams et al., 2018), QQP (Iyer et al., 2017), SST-2 (Socher et al., 2013), MRPC (Dolan & Brockett, 2005), Co LA (Warstadt et al., 2019), QNLI (Wang et al., 2019), RTE (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), and four domain-specific datasets: Citation Intent (Jurgens et al., 2018), Sci ERC (Luan et al., 2018), RCT (Dernoncourt & Lee, 2017), Hyper Partisan (Kiesel et al., 2019) from specific domains including computer science, biomedical science and news following Gururangan et al. (2020); Diao et al. (2021).
Dataset Splits Yes We follow Perez et al. (2021) to simulate a true k-shot learning setting. We randomly sample k data from the original training set for each class to construct the training set and another different k data to construct the validation set. The original validation set will be used as the test set. Because the size of the QQP and RCT validation sets is too large, we randomly sample 1K data to save costs.
Hardware Specification Yes For experiments on GPT-3, we directly call its APIs without any GPU for computation. For experiments on RoBERTa, they are conducted with NVIDIA 2080Ti GPUs with 11GB memory.
Software Dependencies No For RoBERTa-large experiments, we initialize it with pre-trained weights by Huggingface s Transformers library7. The paper mentions Huggingface's Transformers library and AdamW, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For BDPL, we optimize the prompts by Adam W (Loshchilov & Hutter, 2019) for 30 epochs with a learning rate of 1 10 4. The prompt length is 50, and the size of the candidate prompt list N is 100. Other hyper-parameters are detailed in the Appendix A.3.