Pareto Prompt Optimization
Authors: Guang Zhao, Byung-Jun Yoon, Gilchan Park, Shantenu Jha, Shinjae Yoo, Xiaoning Qian
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that Pareto Prompt consistently outperforms existing algorithms that use specific objective values. Pareto Prompt also yields robust performances when the objective metrics differ between training and testing. [...] We conduct experiments on single-sentence classification across various datasets using token infilling with a BERT model (Brown, 2020). [...] Experimental Results We use Hypervolume (HV) to evaluate the multi-objective performance of classification Accuracy and the prompt Co LA score, with the reference point set at (0, 0). |
| Researcher Affiliation | Academia | 1Brookhaven National Laboratory, 2Texas A&M University, 3Princeton Plasma Physics Laboratory 4Rutgers University New Brunswick, 5Princeton University |
| Pseudocode | Yes | A.1 PSEUDO-CODE FOR PARETOPROMPT Pseudo-code for Pareto Prompt is summarized in Algorithm 1. Algorithm 1 Pareto Prompt Require: Training dataset X, Reference model πref, Loss Hyperparameters 1: Initialize policy model πθ πref 2: for epoch in range(num epochs) do 3: for x in X do 4: if z1 z2 then loss = ld(z1, z2; x) 5: else if z1 z2 then loss = ld(z2, z1; x) 6: else loss = lnd(z1, z2; x) 7: end if 8: Update πθ with gradient descent on loss 9: end for 10: if (epoch% update period) == 0 then πref πθ 11: end if 12: end for |
| Open Source Code | Yes | The code for our implementation is made available at https://github.com/guangzhao27/Pareto_Prompt. |
| Open Datasets | Yes | We conduct experiments on a diverse set of popular few-shot classification tasks, including MR (Pang & Lee, 2005), SST-5 (Socher et al., 2013), Yelp-5 and Yahoo (Zhang et al., 2015). [...] We conduct the task using the Yelp sentiment dataset (Shen et al., 2017) to convert Yelp negative reviews into positive ones while maintaining the content similarity. |
| Dataset Splits | Yes | For all datasets, we randomly sample 16 samples per class for both the training and validation sets. The final performance is evaluated using a sufficiently large test set. [...] We randomly select 50 negative reviews for training, 50 for evaluation, and a separate set of 100 for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions the use of specific models like BERT, RoBERTa-large, DistilGPT2, LLaMa 2 (7B), and GPT-2 XL, but not the hardware they ran on. |
| Software Dependencies | No | The paper does not specify particular software dependencies (e.g., library or solver names with version numbers like PyTorch 1.9 or CUDA 11.1). It refers to language models like RoBERTa-large, DistilGPT2, and GPT-2 XL, which are models, not specific software versions. |
| Experiment Setup | Yes | Hyperparameters for the loss functions in equations (1), (3) and (4) are set as β = 0.5, τ = 0.5, λ = 1 and ϵ = 0.1. Each RL algorithm runs for 6K iterations for training. For all RL-based algorithms (excluding Inst Optima), 16 prompts are sampled for each iteration to calculate reward functions. Algorithms using dominance relationships (R-IPO and PP-DPO/IPO) employ 8 prompt comparison pairs for reward function calculation. [...] Each algorithm runs for 10K iterations for training, resulting in a total number of language model queries equal to 128 × 8 × 10,000. |