CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

Authors: Han He, Qianchu Liu, Lei Xu, Chaitanya Shivade, Yi Zhang, Sundararajan Srinivasan, Katrin Kirchhoff

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Cri SPO on 4 state-of-the-art Large Language Models (LLMs) across 4 summarization and 5 Question Answering (QA) datasets. Extensive experiments show 3-4% ROUGE score improvement on summarization and substantial improvement of various metrics on QA.
Researcher Affiliation Industry Amazon AWS AI Labs EMAIL
Pseudocode No The paper describes the Cri SPO workflow and its components in text and with a diagram (Figure 1), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/amazon-science/Cri SPO
Open Datasets Yes We select a diverse range of 4 summarization tasks including conventional document summarization tasks such as CNN daily mail (Hermann et al. 2015) (news headline summarization), and also conversation summarization tasks such as SAMSum (Gliwa et al. 2019), Meeting Bank (Hu et al. 2023). In addition, we test on a medical-domain clinical note summarization task, ACI-Bench (Yim et al. 2023). We benchmark Cri SPO on 5 commonly used QA datasets, including 1) Wikipedia-based QA: Natural Questions (Kwiatkowski et al. 2019), Trivia QA (Joshi et al. 2017), Squad (Rajpurkar et al. 2016) 2) story-based abstractive reading comprehension: Narrative QA (Koˇcisk y et al. 2018) and 3) medical domain multiple-choice QA: Med MCQA (Pal, Umapathi, and Sankarasubbu 2022).
Dataset Splits No The paper mentions training, development, and test sets conceptually: "Let Dtrn = {(xi, yi)}i=1...n be the training set, with a development set Ddev and a test set Dtst." It also states: "For efficiency, we only used a small fraction of the train and dev set for the experiments. The specific data settings are listed in Appendix C." However, specific percentages or sample counts for these splits are not provided in the main text.
Hardware Specification No The paper mentions using Large Language Models (LLMs) such as Claude, Mistral 7B, and Llama3 8B. It does not provide any specific hardware details like GPU models, CPU types, or memory specifications used for running the experiments or training these models.
Software Dependencies No The paper mentions using specific LLMs (Claude, Mistral, Llama3) and metrics (ROUGE, Align Score, Bert Score) but does not provide specific version numbers for any software, libraries, or programming languages used in their implementation.
Experiment Setup No The paper states, "Specific hyperparameters with ablations are detailed in Appendix D." However, these specific hyperparameter values or detailed training configurations are not provided within the main body of the paper.