CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation
Authors: Han He, Qianchu Liu, Lei Xu, Chaitanya Shivade, Yi Zhang, Sundararajan Srinivasan, Katrin Kirchhoff
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Cri SPO on 4 state-of-the-art Large Language Models (LLMs) across 4 summarization and 5 Question Answering (QA) datasets. Extensive experiments show 3-4% ROUGE score improvement on summarization and substantial improvement of various metrics on QA. |
| Researcher Affiliation | Industry | Amazon AWS AI Labs EMAIL |
| Pseudocode | No | The paper describes the Cri SPO workflow and its components in text and with a diagram (Figure 1), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/amazon-science/Cri SPO |
| Open Datasets | Yes | We select a diverse range of 4 summarization tasks including conventional document summarization tasks such as CNN daily mail (Hermann et al. 2015) (news headline summarization), and also conversation summarization tasks such as SAMSum (Gliwa et al. 2019), Meeting Bank (Hu et al. 2023). In addition, we test on a medical-domain clinical note summarization task, ACI-Bench (Yim et al. 2023). We benchmark Cri SPO on 5 commonly used QA datasets, including 1) Wikipedia-based QA: Natural Questions (Kwiatkowski et al. 2019), Trivia QA (Joshi et al. 2017), Squad (Rajpurkar et al. 2016) 2) story-based abstractive reading comprehension: Narrative QA (Koˇcisk y et al. 2018) and 3) medical domain multiple-choice QA: Med MCQA (Pal, Umapathi, and Sankarasubbu 2022). |
| Dataset Splits | No | The paper mentions training, development, and test sets conceptually: "Let Dtrn = {(xi, yi)}i=1...n be the training set, with a development set Ddev and a test set Dtst." It also states: "For efficiency, we only used a small fraction of the train and dev set for the experiments. The specific data settings are listed in Appendix C." However, specific percentages or sample counts for these splits are not provided in the main text. |
| Hardware Specification | No | The paper mentions using Large Language Models (LLMs) such as Claude, Mistral 7B, and Llama3 8B. It does not provide any specific hardware details like GPU models, CPU types, or memory specifications used for running the experiments or training these models. |
| Software Dependencies | No | The paper mentions using specific LLMs (Claude, Mistral, Llama3) and metrics (ROUGE, Align Score, Bert Score) but does not provide specific version numbers for any software, libraries, or programming languages used in their implementation. |
| Experiment Setup | No | The paper states, "Specific hyperparameters with ablations are detailed in Appendix D." However, these specific hyperparameter values or detailed training configurations are not provided within the main body of the paper. |