reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

Authors: Yilun Kong, Hangyu Mao, Zhao Qi, Bin Zhang, Jingqing Ruan, Li Shen, Yongzhe Chang, Xueqian Wang, Rui Zhao, Dacheng Tao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios. We evaluate our method across different LLM scales on various neural language understanding and math reasoning tasks using both zero-shot and few-shot settings.
Researcher Affiliation	Collaboration	1Tsinghua University; 2Sense Time Research; 3Institute of automation,Chinese academy of science; 4School of Artificial Intelligence,University of Chinese Academy of Sciences; 5Sun Yat-Sen University; 6Nanyang Technological University. corresponding authors: Hangyu Mao<EMAIL>; Xueqian Wang<EMAIL>
Pseudocode	Yes	Algorithm 1 Query-Dependent Prompt Optimization Require: Initial dataset D0, a collection set for task queries Q, the number of loops T, a pre-trained language model π0, ORL( ) denotes the offline RL fine-tuning process, R( ) presents evaluating the new query-prompt pairs on the target LLM and calculating the reward.
Open Source Code	Yes	The code is available at here. Our code, as well as the offline datasets, is now publicly available at here.
Open Datasets	Yes	We perform experiments on 6 language understanding tasks and 2 math reasoning tasks to validate our methods, including topic classification (AG s News (Zhang et al., 2015)), natural language inference (Bool Q (Clark et al., 2019)), sentiment classification (IMDB (Maas et al., 2011), Tweet Eval Emotion (Mohammad et al., 2018)), multi-choice QA (Cosmos QA (Huang et al., 2019), Hella Swag (Zellers et al., 2019)), and math reasoning (GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021)).
Dataset Splits	Yes	For the task datasets with default testing or development set, we use their original split to obtain our testing set. If there is no official training/development/testing split, we randomly sample a reasonably large set for stable evaluating and testing. Additionally, for all tasks, we split 10% training samples as collection set for initial query collection and subsequent query augmentation, ensuring that the collected queries for training the policy model do not appear in the in-context examples during few-shot evaluation, and simultaneously simulating a scenario where very few questions are available.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA V100 32g GPU.
Software Dependencies	No	No specific versions of ancillary software libraries (e.g., Python, PyTorch, CUDA) are provided. The paper mentions specific LLM models used, but these are the subjects or primary tools of the research, not general software dependencies with version numbers for replication.
Experiment Setup	Yes	The parameters for the experiments are shown in Table 9: Loops 4, Batchsize 128, Learning Rate 1e-3 for the 1st loop, 1e-4 for others, Train Epochs 100 for the 1st loop, 20 for others, Optimizer Adam W, Weight Decay 1e-4, Balancing Parameter λ 0.1. Also, 'Both training and testing are conducted on 3 seeds. We set the maximum expected reward as 100... For data augmentation, we adopt sampling generation with a top-k of 2 and top-p of 0.9, while for evaluation, we adopt greedy generation.'