reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding Chain-of-Thought in LLMs through Information Theory

Authors: Jean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual subtasks. In this section, we demonstrate our framework s utility, dubbed Information-Gain (IG) and compare against two baselines for detecting errors in a model s Co T reasoning.
Researcher Affiliation	Collaboration	Jean-Franc ois Ton * 1 Muhammad Faaiz Taufiq * 1 Yang Liu 2 * denotes equal contribution, where ordering was determined through a coin flip. 1Byte Dance Seed 2UC Santa Cruz. Correspondence to: Jean-Franc ois Ton <EMAIL>, Muhammad Faaiz Taufiq <EMAIL>.
Pseudocode	No	The paper mentions proposing a practical algorithm but does not include a structured pseudocode or algorithm block. For example, it states: 'Based on this framework, we propose a practical algorithm to assess the task-wise performance of models.' without presenting the algorithm itself in a formatted block.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to code repositories.
Open Datasets	Yes	We validate our methods on extensive toy data, the GSM8K (Cobbe et al., 2021) as well as the PRM800K (Lightman et al., 2023) dataset. To further demonstrate the practical applicability of our method, we have conducted an additional experiment on Open AI s PRM800k dataset (Lightman et al., 2023) which is obtained by labeling the intermediate steps of the MATH dataset (Hendrycks et al., 2021).
Dataset Splits	Yes	Having trained the supervisor model on the data generated above, we evaluate the information-gain on a held-out dataset split. Additionally, we also used the sample-wise information-gain (IG) as well as the ORM baseline to classify if a step is correct (as outlined in Section 3.3). To avoid ambiguity, we filtered out the neutral substeps (with labels 0) for this experiment and considered a balanced held-out dataset with equal number of correct and incorrect steps.
Hardware Specification	No	The paper does not explicitly describe any specific hardware used for running its experiments (e.g., GPU models, CPU types, or cloud computing instances with specifications).
Software Dependencies	No	The paper mentions using specific models like "GPT-4", "GPT-2", and "Llama-3-8B", and fine-tuning with "Low Rank Adaptation (LoRA)", but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	No	The paper describes aspects of the experimental setup, such as how toy data was generated and the construction of training data for the supervisor model. For example, in C.2.1, it details: "The generated outputs are used to construct training examples, where each intermediate step is concatenated with the final correct answer using the separator token #\|> ". However, it lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs for fine-tuning the supervisor models) or other detailed training configurations.