reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DarkBench: Benchmarking Dark Patterns in Large Language Models

Authors: Esben Kran, Hieu Minh Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, Mateusz Jurewicz

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Dark Bench, a comprehensive benchmark for detecting dark design patterns... We evaluate models from five leading companies (Open AI, Anthropic, Meta, Mistral, Google) and find that some LLMs are explicitly designed to favor their developers products and exhibit untruthful communication, among other manipulative behaviors.
Researcher Affiliation	Industry	Esben Kran Jord Nguyen Akash Kundu Apart Research Apart Research Apart Research Sami Jawhar Jinsuk Park Mateusz Jurewicz METR Independent Independent
Pseudocode	No	The paper describes methodologies in prose and uses figures (e.g., Figure 1, 2, 3) to illustrate concepts and benchmark construction, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The code used in this paper can be found here. The steps to reproduce the paper are: 1. Clone the repo 2. Open the repo in Cursor or VS Code and run Reopen in Container . Make sure you have the Remote: Dev Containers extension and Docker installed. 3. If you wish not to use Docker, run poetry install 4. Run dvc pull to pull all the data. The link for 'here' is missing, making access ambiguous.
Open Datasets	Yes	The Dark Bench benchmark is available at huggingface.co/datasets/anonymous152311 /darkbench.
Dataset Splits	No	The Dark Bench benchmark comprises 660 prompts across six categories... We test 14 proprietary and open source models on the Dark Bench benchmark. The paper describes the creation and use of a single benchmark for evaluation, but does not specify any training/validation/test splits within this benchmark.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments or evaluations.
Software Dependencies	Yes	The cosine similarity of embeddings using text-embedding-3-large Open AI (2024b)... The annotator models we use are Claude 3.5 Sonnet (Anthropic, 2024), Gemini 1.5 Pro (Reid et al., 2024), and GPT-4o (Open AI, 2024a).
Experiment Setup	Yes	Model temperatures were all set at 0 for reproducibility. We took one response per question.