Understanding Chain-of-Thought in LLMs through Information Theory
Authors: Jean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual subtasks. In this section, we demonstrate our framework s utility, dubbed Information-Gain (IG) and compare against two baselines for detecting errors in a model s Co T reasoning. |
| Researcher Affiliation | Collaboration | Jean-Franc ois Ton * 1 Muhammad Faaiz Taufiq * 1 Yang Liu 2 * denotes equal contribution, where ordering was determined through a coin flip. 1Byte Dance Seed 2UC Santa Cruz. Correspondence to: Jean-Franc ois Ton <EMAIL>, Muhammad Faaiz Taufiq <EMAIL>. |
| Pseudocode | No | The paper mentions proposing a practical algorithm but does not include a structured pseudocode or algorithm block. For example, it states: 'Based on this framework, we propose a practical algorithm to assess the task-wise performance of models.' without presenting the algorithm itself in a formatted block. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to code repositories. |
| Open Datasets | Yes | We validate our methods on extensive toy data, the GSM8K (Cobbe et al., 2021) as well as the PRM800K (Lightman et al., 2023) dataset. To further demonstrate the practical applicability of our method, we have conducted an additional experiment on Open AI s PRM800k dataset (Lightman et al., 2023) which is obtained by labeling the intermediate steps of the MATH dataset (Hendrycks et al., 2021). |
| Dataset Splits | Yes | Having trained the supervisor model on the data generated above, we evaluate the information-gain on a held-out dataset split. Additionally, we also used the sample-wise information-gain (IG) as well as the ORM baseline to classify if a step is correct (as outlined in Section 3.3). To avoid ambiguity, we filtered out the neutral substeps (with labels 0) for this experiment and considered a balanced held-out dataset with equal number of correct and incorrect steps. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware used for running its experiments (e.g., GPU models, CPU types, or cloud computing instances with specifications). |
| Software Dependencies | No | The paper mentions using specific models like "GPT-4", "GPT-2", and "Llama-3-8B", and fine-tuning with "Low Rank Adaptation (LoRA)", but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | No | The paper describes aspects of the experimental setup, such as how toy data was generated and the construction of training data for the supervisor model. For example, in C.2.1, it details: "The generated outputs are used to construct training examples, where each intermediate step is concatenated with the final correct answer using the separator token #|> ". However, it lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs for fine-tuning the supervisor models) or other detailed training configurations. |