reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective

Authors: Jianyu Wang, Zhiqiang Hu, Lidong Bing

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our search framework through large-scale experiments across tasks and LLMs. Our results show that pruning a low-shot ICL prompt could perform comparably to state-of-the-art methods on a range of tasks, while maintaining competitive runtime efficiency compared to prior optimization methods (e.g., Table 10 in Appendix).
Researcher Affiliation	Collaboration	1DAMO Academy, Alibaba Group 2Hupan Lab 3Miro Mind. Correspondence to: Jianyu Wang <EMAIL>.
Pseudocode	Yes	We present the pseudocode of our TAPruning algorithm in Algorithm 1. [...] Algorithm 3 Genetic Prompt-Quine (PROMPTQUINE) Framework for Prompt Subsequence Search. [...] Algorithm 4 PROMPTQUINE s Generational GA (GGA) implementation for Prompt Subsequence Search. [...] Algorithm 5 PROMPTQUINE s Steady-state GA (SSGA) implementation for Prompt Subsequence Search.
Open Source Code	Yes	We also release some direct predictions of our pruned prompts on the Adv Bench in the Git Hub repository1, ensuring rigor given the different dataseparation schemes used across prior studies (Jiang et al., 2024; Paulus et al., 2024). 1github.com/jianyu-cs/Prompt Quine/examples/jailbreaking/
Open Datasets	Yes	We evaluate sentiment analysis (SST-2 (Socher et al., 2013), Yelp-5 (Asghar, 2016)), subjectivity classification, Subj (Pang & Lee, 2004), topic classification (AG s News (Zhang et al., 2015) and Yahoo (Labrou & Finin, 1999)), and natural language inference (SNLI (Bowman et al., 2015)).
Dataset Splits	Yes	During the search stage, each prompt s quality is evaluated on 200 samples from the official validation split (as our held-out set), or the training split if the validation is unavailable (e.g., PIQA). We report performance on the official test set (i.e., validation set for PIQA). For jailbreaking... We split the original 520 samples into 100 for validation and 420 for testing, using samples from the validation set for fitness estimation and prompt selection.
Hardware Specification	Yes	We produce these experiments on one NVIDIA A100 GPU, following their default configurations on Meta-Llama-3-8B-Instruct.
Software Dependencies	No	In our current implementation, we mainly use batching along with efficient LLM serving tools, such as v LLM (Kwon et al., 2023)... For LLMLingua, we follow its default setup (Jiang et al., 2023c)... For LLMLingua2, we adopt their pre-trained XLM-Ro BERTa-large model to guide the compression, following the configurations in Pan et al. (2024). This text mentions software components and models but does not provide specific version numbers for them.
Experiment Setup	Yes	We present our shared hyperparameters for both SSGA and GGA under 1-shot ICL pruning in Table 9. [...] We set λ1 and λ2 as 180 and 200, following (Deng et al., 2022).