Reflection-Bench: Evaluating Epistemic Agency in Large Language Models
Authors: Lingyu Li, Yixu Wang, Haiquan Zhao, Shuqi Kong, Yan Teng, Chunbo Li, Yingchun Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. We conducted comprehensive evaluations using entry-level configurations across leading large reasoning models, mainstream LLMs, and the Qwen-2.5 family with varying sizes. Three prompting strategies were employed, including direct generation, free output, and Chain of Thought (Co T) (Wei et al., 2024). |
| Researcher Affiliation | Academia | 1Shanghai Artificial Intelligence Laboratory, China 2Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, China. Correspondence to: Yan Teng <EMAIL>, Chunbo Li <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in prose, detailing the cognitive tests and their adaptations (Sections 3.2 and 3.3), but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data are available at https://github. com/AI45Lab/Reflection Bench. |
| Open Datasets | Yes | Our code and data are available at https://github. com/AI45Lab/Reflection Bench. |
| Dataset Splits | No | The paper describes the parameters and number of trials for each cognitive test, such as 'x trials' for WCST with rules changing every 'x/6 trial' and '20 blocks of n trials each' for MBT. However, it does not specify traditional training/test/validation dataset splits for static datasets, as the evaluation involves interactive cognitive tasks where models learn in-context. |
| Hardware Specification | No | All evaluations are conducted through respective model APIs. |
| Software Dependencies | No | The paper states 'All evaluations are conducted through respective model APIs' and mentions 'Open AI’s text-embedding-3-large' for automated scoring, but it does not provide specific version numbers for software libraries or dependencies used to run the experiments. |
| Experiment Setup | Yes | We first evaluate 16 LLMs on the Reflection-Bench: ... Three prompting strategies were employed, including direct generation, free output, and Chain of Thought (Co T) (Wei et al., 2024). The results reveal a clear three-tier hierarchy... Table 1 (Easy) details the number of trials and parameter settings for each task. Table 1. Experiment settings Task Parameters Easy Hard WPT p=0.9 p=0.8 WCST x=72 x=90 Oddball NA NA N-back n=2 n=4 DC-IGT Ploss = {.5, .1, .5, .1} {.5, .2, .5, .2} PRLT p=0.8 p=0.7 MBT n=2 n=4 |