reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models

Authors: Guangzhi Sun, Xiao Zhan, Shutong Feng, Phil Woodland, Jose Such

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments (p <0.0001 from a z-test), underscoring the necessity of context in safety evaluations. ... Comprehensive analyses of LLM safety judgments and comparisons across a wide range of popular LLMs were conducted using CASE-Bench.
Researcher Affiliation	Academia	1Trinity College, University of Cambridge, Cambridge, United Kingdom 2Department of Informatics, London, United Kingdom 3Institut f ur Informatik, Heinrich-Heine-Universit at D usseldorf, Germany 4VRAIN, Universitat Polit ecnica de Val encia, Spain. Correspondence to: Guangzhi Sun <EMAIL>.
Pseudocode	No	The paper describes methodologies such as the data creation pipeline (Fig. 2) and the application of CI theory in text, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data used in the paper are available at https: //github.com/Brians IDP/CASEBench.
Open Datasets	Yes	Code and data used in the paper are available at https: //github.com/Brians IDP/CASEBench. ... CASE-Bench adopts the queries from SORRY-Bench (Xie et al., 2024)... Our dataset includes queries from Sorry-Bench (Xie et al., 2024), and access to these queries must comply with the researchers agreement and require granted access on Hugging Face.
Dataset Splits	No	The paper evaluates the performance of selected LLMs on CASE-Bench using various methods (binary classification, direct score, normalized token probabilities) on 900 query-context pairs. However, it does not provide explicit training/test/validation dataset splits for training a new model, as it focuses on evaluating pre-trained models rather than developing a new model within the paper. It describes a 'between-subjects design' for human annotators, not ML dataset splits.
Hardware Specification	Yes	Our experiments used 2 Nividia A100 GPUs to perform inference for open-source LLMs.
Software Dependencies	Yes	Specifically, the power analysis was conducted using G*Power 3.1 (Erdfelder et al., 1996).
Experiment Setup	Yes	Specifically, the power analysis was conducted using G*Power 3.1 (Erdfelder et al., 1996). We assumed an effect size of f = 0.4... We set the alpha level (Type I error rate) at α = 0.05... we aimed for a power of 0.8 (80%)... increased the sample size to 21 annotators per task. ... The following three methods were examined to obtain the judgment from each model as well as the degree of harmlessness: Binary classification... Direct score... Normalized token probabilities...