QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning
Authors: Yilun Kong, Hangyu Mao, Zhao Qi, Bin Zhang, Jingqing Ruan, Li Shen, Yongzhe Chang, Xueqian Wang, Rui Zhao, Dacheng Tao
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios. We evaluate our method across different LLM scales on various neural language understanding and math reasoning tasks using both zero-shot and few-shot settings. |
| Researcher Affiliation | Collaboration | 1Tsinghua University; 2Sense Time Research; 3Institute of automation,Chinese academy of science; 4School of Artificial Intelligence,University of Chinese Academy of Sciences; 5Sun Yat-Sen University; 6Nanyang Technological University. corresponding authors: Hangyu Mao<EMAIL>; Xueqian Wang<EMAIL> |
| Pseudocode | Yes | Algorithm 1 Query-Dependent Prompt Optimization Require: Initial dataset D0, a collection set for task queries Q, the number of loops T, a pre-trained language model π0, ORL( ) denotes the offline RL fine-tuning process, R( ) presents evaluating the new query-prompt pairs on the target LLM and calculating the reward. |
| Open Source Code | Yes | The code is available at here. Our code, as well as the offline datasets, is now publicly available at here. |
| Open Datasets | Yes | We perform experiments on 6 language understanding tasks and 2 math reasoning tasks to validate our methods, including topic classification (AG s News (Zhang et al., 2015)), natural language inference (Bool Q (Clark et al., 2019)), sentiment classification (IMDB (Maas et al., 2011), Tweet Eval Emotion (Mohammad et al., 2018)), multi-choice QA (Cosmos QA (Huang et al., 2019), Hella Swag (Zellers et al., 2019)), and math reasoning (GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021)). |
| Dataset Splits | Yes | For the task datasets with default testing or development set, we use their original split to obtain our testing set. If there is no official training/development/testing split, we randomly sample a reasonably large set for stable evaluating and testing. Additionally, for all tasks, we split 10% training samples as collection set for initial query collection and subsequent query augmentation, ensuring that the collected queries for training the policy model do not appear in the in-context examples during few-shot evaluation, and simultaneously simulating a scenario where very few questions are available. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA V100 32g GPU. |
| Software Dependencies | No | No specific versions of ancillary software libraries (e.g., Python, PyTorch, CUDA) are provided. The paper mentions specific LLM models used, but these are the subjects or primary tools of the research, not general software dependencies with version numbers for replication. |
| Experiment Setup | Yes | The parameters for the experiments are shown in Table 9: Loops 4, Batchsize 128, Learning Rate 1e-3 for the 1st loop, 1e-4 for others, Train Epochs 100 for the 1st loop, 20 for others, Optimizer Adam W, Weight Decay 1e-4, Balancing Parameter λ 0.1. Also, 'Both training and testing are conducted on 3 seeds. We set the maximum expected reward as 100... For data augmentation, we adopt sampling generation with a top-k of 2 and top-p of 0.9, while for evaluation, we adopt greedy generation.' |