reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Authors: Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. ORBench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/benchllms and our codebase is open-sourced at https://github.com/justincui03/or-bench.
Researcher Affiliation	Academia	Justin Cui 1 Wei-Lin Chiang 2 Ion Stoica 2 Cho-Jui Hsieh 1 1UCLA 2UC Berkeley. Correspondence to: Justin Cui <EMAIL>, Cho-Jui Hsieh <EMAIL>.
Pseudocode	No	The paper describes methods and a pipeline in prose (e.g., Section 3.2 "Over-Refusal Prompt generation" and Figure 2), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our datasets are publicly available at https://huggingface.co/benchllms and our codebase is open-sourced at https://github.com/justincui03/or-bench.
Open Datasets	Yes	Our datasets are publicly available at https://huggingface.co/benchllms and our codebase is open-sourced at https://github.com/justincui03/or-bench.
Dataset Splits	No	The paper introduces a benchmark (OR-Bench) consisting of 80,000 over-refusal prompts, a 1,000 hard prompts subset, and 600 toxic prompts. These datasets are used for evaluation of existing LLMs. However, the paper does not specify any training/test/validation splits for training a new model or for reproducing model development, as its focus is on benchmarking pre-existing models.
Hardware Specification	No	Section 4.1 "Experiment setup" states: "We benchmark 32 models from 8 families, both blackbox and open-source, including Claude-2.1, 3, and 3.5, Gemini-1.0-pro, Gemini-1.5-{flash, pro}, and the open-source Gemma series, GPT-3.5-turbo-{0125, 0301, 0613}, GPT-4-0125-preview, GPT-4-turbo-2024-04-09, original GPT-4o, and GPT-4o-08-06, as well as all Llama models... All models are tested via public APIs without system prompts to ensure unbiased evaluation." The paper describes using public APIs for evaluating models, which means the experiments were run on external infrastructure, and therefore, specific hardware details used by the authors for their own computations or analysis are not provided.
Software Dependencies	No	The paper mentions models and tools used (e.g., "Mixtral 8*7B", "GPT-4", "Mistral-7B-Instruct-v0.3"), but does not list specific software dependencies (libraries, frameworks) with version numbers that would be required to set up their development environment or run their codebase.
Experiment Setup	Yes	Section 4.1 "Experiment setup" states: "All models are tested via public APIs without system prompts to ensure unbiased evaluation (R ottger et al., 2023; Zheng et al.)." The caption for Table 2 and the first paragraph of Section 5 (Ablation study) explicitly mention: "Results are measured with temperature 0.0."