reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reflection-Bench: Evaluating Epistemic Agency in Large Language Models

Authors: Lingyu Li, Yixu Wang, Haiquan Zhao, Shuqi Kong, Yan Teng, Chunbo Li, Yingchun Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. We conducted comprehensive evaluations using entry-level configurations across leading large reasoning models, mainstream LLMs, and the Qwen-2.5 family with varying sizes. Three prompting strategies were employed, including direct generation, free output, and Chain of Thought (Co T) (Wei et al., 2024).
Researcher Affiliation	Academia	1Shanghai Artificial Intelligence Laboratory, China 2Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, China. Correspondence to: Yan Teng <EMAIL>, Chunbo Li <EMAIL>.
Pseudocode	No	The paper describes the methodology in prose, detailing the cognitive tests and their adaptations (Sections 3.2 and 3.3), but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are available at https://github. com/AI45Lab/Reflection Bench.
Open Datasets	Yes	Our code and data are available at https://github. com/AI45Lab/Reflection Bench.
Dataset Splits	No	The paper describes the parameters and number of trials for each cognitive test, such as 'x trials' for WCST with rules changing every 'x/6 trial' and '20 blocks of n trials each' for MBT. However, it does not specify traditional training/test/validation dataset splits for static datasets, as the evaluation involves interactive cognitive tasks where models learn in-context.
Hardware Specification	No	All evaluations are conducted through respective model APIs.
Software Dependencies	No	The paper states 'All evaluations are conducted through respective model APIs' and mentions 'Open AI’s text-embedding-3-large' for automated scoring, but it does not provide specific version numbers for software libraries or dependencies used to run the experiments.
Experiment Setup	Yes	We first evaluate 16 LLMs on the Reflection-Bench: ... Three prompting strategies were employed, including direct generation, free output, and Chain of Thought (Co T) (Wei et al., 2024). The results reveal a clear three-tier hierarchy... Table 1 (Easy) details the number of trials and parameter settings for each task. Table 1. Experiment settings Task Parameters Easy Hard WPT p=0.9 p=0.8 WCST x=72 x=90 Oddball NA NA N-back n=2 n=4 DC-IGT Ploss = {.5, .1, .5, .1} {.5, .2, .5, .2} PRLT p=0.8 p=0.7 MBT n=2 n=4