reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Authors: Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs safety refusal capabilities, in a balanced, granular, and efficient manner.
Researcher Affiliation	Academia	1Princeton University 2Virginia Tech 3Stanford University 4UC Berkeley 5University of Illinois at Urbana-Champaign 6University of Chicago
Pseudocode	No	The paper describes methodologies and procedures in narrative text and figures, but it does not contain any clearly labeled pseudocode blocks, algorithms, or code-like formatted steps.
Open Source Code	Yes	Benchmark demo, data, code, and models are available through https://sorry-bench.github.io.
Open Datasets	Yes	Benchmark demo, data, code, and models are available through https://sorry-bench.github.io. ... Plus, we are hosting our datasets and evaluation code on public platforms (Hugging Face and Github).
Dataset Splits	Yes	These human annotations are further splitted into a train split of 440 * (3 ID + 3 OOD) = 2,640 records (used to directly train evaluators), and the rest 4,400 as the test split.
Hardware Specification	Yes	All our experiments are conducted on our university s internal cluster, where each computing node is equipped with 4 Nvidia A100 GPUs (80GB).
Software Dependencies	No	The paper does not provide specific version numbers for any software, libraries, or programming languages used in its implementation.
Experiment Setup	Yes	For most of the 56 LLMs benchmarked in Fig 4, we sample their responses once with no system prompt, at a temperature of 0.78, Top-P of 1.0, and limit the max generated tokens by 1024; see Appendix K for details).