SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Authors: Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs safety refusal capabilities, in a balanced, granular, and efficient manner. |
| Researcher Affiliation | Academia | 1Princeton University 2Virginia Tech 3Stanford University 4UC Berkeley 5University of Illinois at Urbana-Champaign 6University of Chicago |
| Pseudocode | No | The paper describes methodologies and procedures in narrative text and figures, but it does not contain any clearly labeled pseudocode blocks, algorithms, or code-like formatted steps. |
| Open Source Code | Yes | Benchmark demo, data, code, and models are available through https://sorry-bench.github.io. |
| Open Datasets | Yes | Benchmark demo, data, code, and models are available through https://sorry-bench.github.io. ... Plus, we are hosting our datasets and evaluation code on public platforms (Hugging Face and Github). |
| Dataset Splits | Yes | These human annotations are further splitted into a train split of 440 * (3 ID + 3 OOD) = 2,640 records (used to directly train evaluators), and the rest 4,400 as the test split. |
| Hardware Specification | Yes | All our experiments are conducted on our university s internal cluster, where each computing node is equipped with 4 Nvidia A100 GPUs (80GB). |
| Software Dependencies | No | The paper does not provide specific version numbers for any software, libraries, or programming languages used in its implementation. |
| Experiment Setup | Yes | For most of the 56 LLMs benchmarked in Fig 4, we sample their responses once with no system prompt, at a temperature of 0.78, Top-P of 1.0, and limit the max generated tokens by 1024; see Appendix K for details). |