reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Emergent Response Planning in LLMs

Authors: Zhichen Dong, Zhanhui Zhou, Zhixuan Liu, Chao Yang, Chaochao Lu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present experimental results across six tasks, showing that LLM hidden prompt representations encode rich information about upcoming responses and can be used to probe and predict global response attributes. Insight 1: Models present emergent planning on structure, content, and behavior attributes, which can be probed with high accuracy (Fig. 2). Our in-dataset probing experiments (where probes are trained and tested on different splits of the same prompt dataset) reveal that both base and fine-tuned models encode structure, content, and behavior attributes, with fine-tuned models showing superior performance.
Researcher Affiliation	Academia	1Shanghai Artificial Intelligence Laboratory 2Work done while at Shanghai Artificial Intelligence Laboratory. Correspondence to: Chao Yang <EMAIL>, Zhichen Dong <EMAIL>, Zhanhui Zhou <EMAIL>.
Pseudocode	No	The paper describes methods but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is provided or made publicly available, nor does it include any repository links.
Open Datasets	Yes	Datasets Links Ultrachat (Ding et al., 2023) https://huggingface.co/datasets/stingning/ ultrachat Alpaca Eval (Taori et al., 2023) https://huggingface.co/datasets/tatsu-lab/ alpaca GSM8K (Cobbe et al., 2021) https://huggingface.co/datasets/openai/gsm8k MATH (Saxton et al., 2019) https://huggingface.co/datasets/deepmind/math_ dataset Tiny Stories (Eldan & Li, 2023) https://huggingface.co/datasets/roneneldan/ Tiny Stories ROCStories (Mostafazadeh et al., 2016) https://huggingface.co/datasets/Ximing/ ROCStories Commonsense QA (Talmor et al., 2019) https://huggingface.co/datasets/tau/ commonsense_qa Social IQA (Sap et al., 2019) https://huggingface.co/datasets/allenai/social_ i_qa Med MCQA (Pal et al., 2022) https://huggingface.co/datasets/ openlifescienceai/medmcqa ARC-Challenge (Clark et al., 2018) https://huggingface.co/datasets/allenai/ai2_arc CREAK (Onoe et al., 2021) https://huggingface.co/datasets/amydeng2000/ CREAK FEVER (Thorne et al., 2018) https://huggingface.co/datasets/fever/fever
Dataset Splits	Yes	Datasets are split 60 : 20 : 20 for train-validation-test.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that are required to reproduce the experiments.
Experiment Setup	Yes	We train one-hidden-layer MLPs with Re LU activation, with hidden sizes chosen among W = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024}. The output size is 1 for regression and the number of classes with a softmax layer for classification. Each probe is trained for 400 epochs using MSELoss for regression and Cross Entropy Loss for classification. Datasets are split 60 : 20 : 20 for train-validation-test. We perform a grid search over MLP hidden sizes W and representation layers H (as inputs to the probes), reporting the test scores for the best hyperparameters.