Efficient Prompt Optimization Through the Lens of Best Arm Identification

Authors: Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, Cong Shen

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple well-adopted tasks using various LLMs demonstrate the remarkable performance improvement of TRIPLE over baselines while satisfying the limited budget constraints.
Researcher Affiliation Academia Chengshuai Shi University of Virginia EMAIL Kun Yang University of Virginia EMAIL Zihan Chen University of Virginia EMAIL Jundong Li University of Virginia EMAIL Jing Yang The Pennsylvania State University EMAIL Cong Shen University of Virginia EMAIL
Pseudocode Yes Their complete descriptions are provided in Algs. 2 and 3 of Appendix C. Algorithm 1 TRIPLE-CLST... Algorithm 3 TRIPLE-CR... Algorithm 4 TRIPLE-GSE... Algorithm 5 TRIPLE-CSAR... Algorithm 6 TRIPLE-SAR
Open Source Code Yes The experimental codes can be found at https://github.com/Shen Group/TRIPLE.
Open Datasets Yes Extensive experimental results are reported to evaluate the efficiency of TRIPLE across diverse prompting tasks from two standard datasets: Instruction-Induction [30] and Big Bench [69].
Dataset Splits Yes Furthermore, to avoid overfitting and convergence issues, we adopt the standard approach by dividing our interaction data into training (80%) and validation (20%) sets.
Hardware Specification Yes We use a workstation with two Nvidia-A6000 Ada GPUs for all experiments using white-box LLMs (i.e., Llama2, Mistral, and Gemma).
Software Dependencies No The paper mentions specific LLM models (GPT-3.5: gpt-3.5-turbo-1106, Llama2: Llama2-7b, Gemma: Gemma-7b, Mistral: Mistral-7B-v0.2) and OpenAI components (cl100k_base tokenizer, text-embedding-ada-002 model). While these are specific tools, the paper does not list broader software dependencies with explicit version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version, or other general libraries) that would be needed to replicate the entire experimental environment.
Experiment Setup Yes In experiments with TRIPLE-CLST, the number of clusters is set as L = p|P| and a third of our total budget is allocated for the initial phase, i.e., N1 = N/3... For the APO framework... we set {num_feedback} to 2 and {num_prompts} to 5... in the implementation of TRIPLE-GSE, we first employ a projection to 64 dimensions... we set this error threshold at 0.1 in our experiments.