Potemkin Understanding in Large Language Models
Authors: Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology 2University of Chicago 3Harvard University 4Massachusetts Institute of Technology. Correspondence to: Marina Mancoridis <EMAIL>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It describes methodologies in narrative text and figures. |
| Open Source Code | Yes | All collected data, annotations, and analysis are made publicly available at the Potemkin Benchmark Repository.3 (footnote 3: https://github.com/MarinaMancoridis/PotemkinBenchmark.git) |
| Open Datasets | Yes | All collected data, annotations, and analysis are made publicly available at the Potemkin Benchmark Repository.3 (footnote 3: https://github.com/MarinaMancoridis/PotemkinBenchmark.git). We collect a benchmark dataset across three domains literary techniques, game theory, and psychological biases, collecting 3, 159 labeled data points. |
| Dataset Splits | No | The paper describes the creation and collection of a benchmark dataset for evaluation but does not specify training, validation, or test splits for models, as it focuses on evaluating pre-existing LLMs rather than training new ones. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. It mentions using 'APIs from Open AI, Together.AI, Anthropic, and Google' but without specific versions. |
| Experiment Setup | No | The paper describes the experimental setup for evaluating large language models on a custom benchmark, detailing how prompts were constructed and responses annotated across different tasks (definition, classification, generation, editing). However, it does not specify hyperparameters, optimizer settings, or other system-level training configurations for models developed by the authors, as the study focuses on evaluating pre-existing LLMs. |