Efficient Prompt Optimization Through the Lens of Best Arm Identification
Authors: Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, Cong Shen
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple well-adopted tasks using various LLMs demonstrate the remarkable performance improvement of TRIPLE over baselines while satisfying the limited budget constraints. |
| Researcher Affiliation | Academia | Chengshuai Shi University of Virginia EMAIL Kun Yang University of Virginia EMAIL Zihan Chen University of Virginia EMAIL Jundong Li University of Virginia EMAIL Jing Yang The Pennsylvania State University EMAIL Cong Shen University of Virginia EMAIL |
| Pseudocode | Yes | Their complete descriptions are provided in Algs. 2 and 3 of Appendix C. Algorithm 1 TRIPLE-CLST... Algorithm 3 TRIPLE-CR... Algorithm 4 TRIPLE-GSE... Algorithm 5 TRIPLE-CSAR... Algorithm 6 TRIPLE-SAR |
| Open Source Code | Yes | The experimental codes can be found at https://github.com/Shen Group/TRIPLE. |
| Open Datasets | Yes | Extensive experimental results are reported to evaluate the efficiency of TRIPLE across diverse prompting tasks from two standard datasets: Instruction-Induction [30] and Big Bench [69]. |
| Dataset Splits | Yes | Furthermore, to avoid overfitting and convergence issues, we adopt the standard approach by dividing our interaction data into training (80%) and validation (20%) sets. |
| Hardware Specification | Yes | We use a workstation with two Nvidia-A6000 Ada GPUs for all experiments using white-box LLMs (i.e., Llama2, Mistral, and Gemma). |
| Software Dependencies | No | The paper mentions specific LLM models (GPT-3.5: gpt-3.5-turbo-1106, Llama2: Llama2-7b, Gemma: Gemma-7b, Mistral: Mistral-7B-v0.2) and OpenAI components (cl100k_base tokenizer, text-embedding-ada-002 model). While these are specific tools, the paper does not list broader software dependencies with explicit version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version, or other general libraries) that would be needed to replicate the entire experimental environment. |
| Experiment Setup | Yes | In experiments with TRIPLE-CLST, the number of clusters is set as L = p|P| and a third of our total budget is allocated for the initial phase, i.e., N1 = N/3... For the APO framework... we set {num_feedback} to 2 and {num_prompts} to 5... in the implementation of TRIPLE-GSE, we first employ a projection to 64 dimensions... we set this error threshold at 0.1 in our experiments. |