reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Potemkin Understanding in Large Language Models

Authors: Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
Researcher Affiliation	Academia	1Massachusetts Institute of Technology 2University of Chicago 3Harvard University 4Massachusetts Institute of Technology. Correspondence to: Marina Mancoridis <EMAIL>.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. It describes methodologies in narrative text and figures.
Open Source Code	Yes	All collected data, annotations, and analysis are made publicly available at the Potemkin Benchmark Repository.3 (footnote 3: https://github.com/MarinaMancoridis/PotemkinBenchmark.git)
Open Datasets	Yes	All collected data, annotations, and analysis are made publicly available at the Potemkin Benchmark Repository.3 (footnote 3: https://github.com/MarinaMancoridis/PotemkinBenchmark.git). We collect a benchmark dataset across three domains literary techniques, game theory, and psychological biases, collecting 3, 159 labeled data points.
Dataset Splits	No	The paper describes the creation and collection of a benchmark dataset for evaluation but does not specify training, validation, or test splits for models, as it focuses on evaluating pre-existing LLMs rather than training new ones.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers. It mentions using 'APIs from Open AI, Together.AI, Anthropic, and Google' but without specific versions.
Experiment Setup	No	The paper describes the experimental setup for evaluating large language models on a custom benchmark, detailing how prompts were constructed and responses annotated across different tasks (definition, classification, generation, editing). However, it does not specify hyperparameters, optimizer settings, or other system-level training configurations for models developed by the authors, as the study focuses on evaluating pre-existing LLMs.