SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Authors: Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs safety refusal capabilities, in a balanced, granular, and efficient manner.
Researcher Affiliation Academia 1Princeton University 2Virginia Tech 3Stanford University 4UC Berkeley 5University of Illinois at Urbana-Champaign 6University of Chicago
Pseudocode No The paper describes methodologies and procedures in narrative text and figures, but it does not contain any clearly labeled pseudocode blocks, algorithms, or code-like formatted steps.
Open Source Code Yes Benchmark demo, data, code, and models are available through https://sorry-bench.github.io.
Open Datasets Yes Benchmark demo, data, code, and models are available through https://sorry-bench.github.io. ... Plus, we are hosting our datasets and evaluation code on public platforms (Hugging Face and Github).
Dataset Splits Yes These human annotations are further splitted into a train split of 440 * (3 ID + 3 OOD) = 2,640 records (used to directly train evaluators), and the rest 4,400 as the test split.
Hardware Specification Yes All our experiments are conducted on our university s internal cluster, where each computing node is equipped with 4 Nvidia A100 GPUs (80GB).
Software Dependencies No The paper does not provide specific version numbers for any software, libraries, or programming languages used in its implementation.
Experiment Setup Yes For most of the 56 LLMs benchmarked in Fig 4, we sample their responses once with no system prompt, at a temperature of 0.78, Top-P of 1.0, and limit the max generated tokens by 1024; see Appendix K for details).