reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

Authors: Han He, Qianchu Liu, Lei Xu, Chaitanya Shivade, Yi Zhang, Sundararajan Srinivasan, Katrin Kirchhoff

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Cri SPO on 4 state-of-the-art Large Language Models (LLMs) across 4 summarization and 5 Question Answering (QA) datasets. Extensive experiments show 3-4% ROUGE score improvement on summarization and substantial improvement of various metrics on QA.
Researcher Affiliation	Industry	Amazon AWS AI Labs EMAIL
Pseudocode	No	The paper describes the Cri SPO workflow and its components in text and with a diagram (Figure 1), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/amazon-science/Cri SPO
Open Datasets	Yes	We select a diverse range of 4 summarization tasks including conventional document summarization tasks such as CNN daily mail (Hermann et al. 2015) (news headline summarization), and also conversation summarization tasks such as SAMSum (Gliwa et al. 2019), Meeting Bank (Hu et al. 2023). In addition, we test on a medical-domain clinical note summarization task, ACI-Bench (Yim et al. 2023). We benchmark Cri SPO on 5 commonly used QA datasets, including 1) Wikipedia-based QA: Natural Questions (Kwiatkowski et al. 2019), Trivia QA (Joshi et al. 2017), Squad (Rajpurkar et al. 2016) 2) story-based abstractive reading comprehension: Narrative QA (Koˇcisk y et al. 2018) and 3) medical domain multiple-choice QA: Med MCQA (Pal, Umapathi, and Sankarasubbu 2022).
Dataset Splits	No	The paper mentions training, development, and test sets conceptually: "Let Dtrn = {(xi, yi)}i=1...n be the training set, with a development set Ddev and a test set Dtst." It also states: "For efficiency, we only used a small fraction of the train and dev set for the experiments. The specific data settings are listed in Appendix C." However, specific percentages or sample counts for these splits are not provided in the main text.
Hardware Specification	No	The paper mentions using Large Language Models (LLMs) such as Claude, Mistral 7B, and Llama3 8B. It does not provide any specific hardware details like GPU models, CPU types, or memory specifications used for running the experiments or training these models.
Software Dependencies	No	The paper mentions using specific LLMs (Claude, Mistral, Llama3) and metrics (ROUGE, Align Score, Bert Score) but does not provide specific version numbers for any software, libraries, or programming languages used in their implementation.
Experiment Setup	No	The paper states, "Specific hyperparameters with ablations are detailed in Appendix D." However, these specific hyperparameter values or detailed training configurations are not provided within the main body of the paper.