Reflection-Bench: Evaluating Epistemic Agency in Large Language Models

Authors: Lingyu Li, Yixu Wang, Haiquan Zhao, Shuqi Kong, Yan Teng, Chunbo Li, Yingchun Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. We conducted comprehensive evaluations using entry-level configurations across leading large reasoning models, mainstream LLMs, and the Qwen-2.5 family with varying sizes. Three prompting strategies were employed, including direct generation, free output, and Chain of Thought (Co T) (Wei et al., 2024).
Researcher Affiliation Academia 1Shanghai Artificial Intelligence Laboratory, China 2Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, China. Correspondence to: Yan Teng <EMAIL>, Chunbo Li <EMAIL>.
Pseudocode No The paper describes the methodology in prose, detailing the cognitive tests and their adaptations (Sections 3.2 and 3.3), but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://github. com/AI45Lab/Reflection Bench.
Open Datasets Yes Our code and data are available at https://github. com/AI45Lab/Reflection Bench.
Dataset Splits No The paper describes the parameters and number of trials for each cognitive test, such as 'x trials' for WCST with rules changing every 'x/6 trial' and '20 blocks of n trials each' for MBT. However, it does not specify traditional training/test/validation dataset splits for static datasets, as the evaluation involves interactive cognitive tasks where models learn in-context.
Hardware Specification No All evaluations are conducted through respective model APIs.
Software Dependencies No The paper states 'All evaluations are conducted through respective model APIs' and mentions 'Open AI’s text-embedding-3-large' for automated scoring, but it does not provide specific version numbers for software libraries or dependencies used to run the experiments.
Experiment Setup Yes We first evaluate 16 LLMs on the Reflection-Bench: ... Three prompting strategies were employed, including direct generation, free output, and Chain of Thought (Co T) (Wei et al., 2024). The results reveal a clear three-tier hierarchy... Table 1 (Easy) details the number of trials and parameter settings for each task. Table 1. Experiment settings Task Parameters Easy Hard WPT p=0.9 p=0.8 WCST x=72 x=90 Oddball NA NA N-back n=2 n=4 DC-IGT Ploss = {.5, .1, .5, .1} {.5, .2, .5, .2} PRLT p=0.8 p=0.7 MBT n=2 n=4