OR-Bench: An Over-Refusal Benchmark for Large Language Models
Authors: Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. ORBench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/benchllms and our codebase is open-sourced at https://github.com/justincui03/or-bench. |
| Researcher Affiliation | Academia | Justin Cui 1 Wei-Lin Chiang 2 Ion Stoica 2 Cho-Jui Hsieh 1 1UCLA 2UC Berkeley. Correspondence to: Justin Cui <EMAIL>, Cho-Jui Hsieh <EMAIL>. |
| Pseudocode | No | The paper describes methods and a pipeline in prose (e.g., Section 3.2 "Over-Refusal Prompt generation" and Figure 2), but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our datasets are publicly available at https://huggingface.co/benchllms and our codebase is open-sourced at https://github.com/justincui03/or-bench. |
| Open Datasets | Yes | Our datasets are publicly available at https://huggingface.co/benchllms and our codebase is open-sourced at https://github.com/justincui03/or-bench. |
| Dataset Splits | No | The paper introduces a benchmark (OR-Bench) consisting of 80,000 over-refusal prompts, a 1,000 hard prompts subset, and 600 toxic prompts. These datasets are used for evaluation of existing LLMs. However, the paper does not specify any training/test/validation splits for training a new model or for reproducing model development, as its focus is on benchmarking pre-existing models. |
| Hardware Specification | No | Section 4.1 "Experiment setup" states: "We benchmark 32 models from 8 families, both blackbox and open-source, including Claude-2.1, 3, and 3.5, Gemini-1.0-pro, Gemini-1.5-{flash, pro}, and the open-source Gemma series, GPT-3.5-turbo-{0125, 0301, 0613}, GPT-4-0125-preview, GPT-4-turbo-2024-04-09, original GPT-4o, and GPT-4o-08-06, as well as all Llama models... All models are tested via public APIs without system prompts to ensure unbiased evaluation." The paper describes using public APIs for evaluating models, which means the experiments were run on external infrastructure, and therefore, specific hardware details used by the authors for their own computations or analysis are not provided. |
| Software Dependencies | No | The paper mentions models and tools used (e.g., "Mixtral 8*7B", "GPT-4", "Mistral-7B-Instruct-v0.3"), but does not list specific software dependencies (libraries, frameworks) with version numbers that would be required to set up their development environment or run their codebase. |
| Experiment Setup | Yes | Section 4.1 "Experiment setup" states: "All models are tested via public APIs without system prompts to ensure unbiased evaluation (R ottger et al., 2023; Zheng et al.)." The caption for Table 2 and the first paragraph of Section 5 (Ablation study) explicitly mention: "Results are measured with temperature 0.0." |