Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective
Authors: Jianyu Wang, Zhiqiang Hu, Lidong Bing
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our search framework through large-scale experiments across tasks and LLMs. Our results show that pruning a low-shot ICL prompt could perform comparably to state-of-the-art methods on a range of tasks, while maintaining competitive runtime efficiency compared to prior optimization methods (e.g., Table 10 in Appendix). |
| Researcher Affiliation | Collaboration | 1DAMO Academy, Alibaba Group 2Hupan Lab 3Miro Mind. Correspondence to: Jianyu Wang <EMAIL>. |
| Pseudocode | Yes | We present the pseudocode of our TAPruning algorithm in Algorithm 1. [...] Algorithm 3 Genetic Prompt-Quine (PROMPTQUINE) Framework for Prompt Subsequence Search. [...] Algorithm 4 PROMPTQUINE s Generational GA (GGA) implementation for Prompt Subsequence Search. [...] Algorithm 5 PROMPTQUINE s Steady-state GA (SSGA) implementation for Prompt Subsequence Search. |
| Open Source Code | Yes | We also release some direct predictions of our pruned prompts on the Adv Bench in the Git Hub repository1, ensuring rigor given the different dataseparation schemes used across prior studies (Jiang et al., 2024; Paulus et al., 2024). 1github.com/jianyu-cs/Prompt Quine/examples/jailbreaking/ |
| Open Datasets | Yes | We evaluate sentiment analysis (SST-2 (Socher et al., 2013), Yelp-5 (Asghar, 2016)), subjectivity classification, Subj (Pang & Lee, 2004), topic classification (AG s News (Zhang et al., 2015) and Yahoo (Labrou & Finin, 1999)), and natural language inference (SNLI (Bowman et al., 2015)). |
| Dataset Splits | Yes | During the search stage, each prompt s quality is evaluated on 200 samples from the official validation split (as our held-out set), or the training split if the validation is unavailable (e.g., PIQA). We report performance on the official test set (i.e., validation set for PIQA). For jailbreaking... We split the original 520 samples into 100 for validation and 420 for testing, using samples from the validation set for fitness estimation and prompt selection. |
| Hardware Specification | Yes | We produce these experiments on one NVIDIA A100 GPU, following their default configurations on Meta-Llama-3-8B-Instruct. |
| Software Dependencies | No | In our current implementation, we mainly use batching along with efficient LLM serving tools, such as v LLM (Kwon et al., 2023)... For LLMLingua, we follow its default setup (Jiang et al., 2023c)... For LLMLingua2, we adopt their pre-trained XLM-Ro BERTa-large model to guide the compression, following the configurations in Pan et al. (2024). This text mentions software components and models but does not provide specific version numbers for them. |
| Experiment Setup | Yes | We present our shared hyperparameters for both SSGA and GGA under 1-shot ICL pruning in Table 9. [...] We set λ1 and λ2 as 180 and 200, following (Deng et al., 2022). |