reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pareto Prompt Optimization

Authors: Guang Zhao, Byung-Jun Yoon, Gilchan Park, Shantenu Jha, Shinjae Yoo, Xiaoning Qian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that Pareto Prompt consistently outperforms existing algorithms that use specific objective values. Pareto Prompt also yields robust performances when the objective metrics differ between training and testing. [...] We conduct experiments on single-sentence classification across various datasets using token infilling with a BERT model (Brown, 2020). [...] Experimental Results We use Hypervolume (HV) to evaluate the multi-objective performance of classification Accuracy and the prompt Co LA score, with the reference point set at (0, 0).
Researcher Affiliation	Academia	1Brookhaven National Laboratory, 2Texas A&M University, 3Princeton Plasma Physics Laboratory 4Rutgers University New Brunswick, 5Princeton University
Pseudocode	Yes	A.1 PSEUDO-CODE FOR PARETOPROMPT Pseudo-code for Pareto Prompt is summarized in Algorithm 1. Algorithm 1 Pareto Prompt Require: Training dataset X, Reference model πref, Loss Hyperparameters 1: Initialize policy model πθ πref 2: for epoch in range(num epochs) do 3: for x in X do 4: if z1 z2 then loss = ld(z1, z2; x) 5: else if z1 z2 then loss = ld(z2, z1; x) 6: else loss = lnd(z1, z2; x) 7: end if 8: Update πθ with gradient descent on loss 9: end for 10: if (epoch% update period) == 0 then πref πθ 11: end if 12: end for
Open Source Code	Yes	The code for our implementation is made available at https://github.com/guangzhao27/Pareto_Prompt.
Open Datasets	Yes	We conduct experiments on a diverse set of popular few-shot classification tasks, including MR (Pang & Lee, 2005), SST-5 (Socher et al., 2013), Yelp-5 and Yahoo (Zhang et al., 2015). [...] We conduct the task using the Yelp sentiment dataset (Shen et al., 2017) to convert Yelp negative reviews into positive ones while maintaining the content similarity.
Dataset Splits	Yes	For all datasets, we randomly sample 16 samples per class for both the training and validation sets. The final performance is evaluated using a sufficiently large test set. [...] We randomly select 50 negative reviews for training, 50 for evaluation, and a separate set of 100 for testing.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions the use of specific models like BERT, RoBERTa-large, DistilGPT2, LLaMa 2 (7B), and GPT-2 XL, but not the hardware they ran on.
Software Dependencies	No	The paper does not specify particular software dependencies (e.g., library or solver names with version numbers like PyTorch 1.9 or CUDA 11.1). It refers to language models like RoBERTa-large, DistilGPT2, and GPT-2 XL, which are models, not specific software versions.
Experiment Setup	Yes	Hyperparameters for the loss functions in equations (1), (3) and (4) are set as β = 0.5, τ = 0.5, λ = 1 and ϵ = 0.1. Each RL algorithm runs for 6K iterations for training. For all RL-based algorithms (excluding Inst Optima), 16 prompts are sampled for each iteration to calculate reward functions. Algorithms using dominance relationships (R-IPO and PP-DPO/IPO) employ 8 prompt comparison pairs for reward function calculation. [...] Each algorithm runs for 10K iterations for training, resulting in a total number of language model queries equal to 128 × 8 × 10,000.