CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models

Authors: Guangzhi Sun, Xiao Zhan, Shutong Feng, Phil Woodland, Jose Such

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments (p <0.0001 from a z-test), underscoring the necessity of context in safety evaluations. ... Comprehensive analyses of LLM safety judgments and comparisons across a wide range of popular LLMs were conducted using CASE-Bench.
Researcher Affiliation Academia 1Trinity College, University of Cambridge, Cambridge, United Kingdom 2Department of Informatics, London, United Kingdom 3Institut f ur Informatik, Heinrich-Heine-Universit at D usseldorf, Germany 4VRAIN, Universitat Polit ecnica de Val encia, Spain. Correspondence to: Guangzhi Sun <EMAIL>.
Pseudocode No The paper describes methodologies such as the data creation pipeline (Fig. 2) and the application of CI theory in text, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and data used in the paper are available at https: //github.com/Brians IDP/CASEBench.
Open Datasets Yes Code and data used in the paper are available at https: //github.com/Brians IDP/CASEBench. ... CASE-Bench adopts the queries from SORRY-Bench (Xie et al., 2024)... Our dataset includes queries from Sorry-Bench (Xie et al., 2024), and access to these queries must comply with the researchers agreement and require granted access on Hugging Face.
Dataset Splits No The paper evaluates the performance of selected LLMs on CASE-Bench using various methods (binary classification, direct score, normalized token probabilities) on 900 query-context pairs. However, it does not provide explicit training/test/validation dataset splits for training a new model, as it focuses on evaluating pre-trained models rather than developing a new model within the paper. It describes a 'between-subjects design' for human annotators, not ML dataset splits.
Hardware Specification Yes Our experiments used 2 Nividia A100 GPUs to perform inference for open-source LLMs.
Software Dependencies Yes Specifically, the power analysis was conducted using G*Power 3.1 (Erdfelder et al., 1996).
Experiment Setup Yes Specifically, the power analysis was conducted using G*Power 3.1 (Erdfelder et al., 1996). We assumed an effect size of f = 0.4... We set the alpha level (Type I error rate) at α = 0.05... we aimed for a power of 0.8 (80%)... increased the sample size to 21 annotators per task. ... The following three methods were examined to obtain the judgment from each model as well as the degree of harmlessness: Binary classification... Direct score... Normalized token probabilities...