Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation
Authors: Juno Kim, Denny Wu, Jason D. Lee, Taiji Suzuki
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Under this framework, we prove that implementing a search protocol that rewards sparse edges improves Co T by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed metastable representation of the reasoning dynamics can be distilled into a smaller, more efficient model. All proofs are deferred to the appendix. A discussion of additional related works is provided in Appendix A. Metastable dynamics and hitting times are studied in Appendices B-C, optimization dynamics are analyzed in Appendix D, and learning-theoretic lower bounds are given in Appendix E. |
| Researcher Affiliation | Academia | 1University of Tokyo and RIKEN AIP 2New York University and Flatiron Institute 3Princeton University. |
| Pseudocode | Yes | Algorithm 1 Two-stage Pretraining, Algorithm 2 Sparse Edge Search, Algorithm 3 PPO-Clip, Algorithm 4 Meta-chain Distillation |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | No | The paper introduces a perturbed Markov chain model for Co T reasoning that differentiates between easy and hard reasoning steps through a dense-sparse structure. It does not describe any experiments conducted on specific open datasets. |
| Dataset Splits | No | The paper does not conduct experiments on datasets, therefore, it does not specify any dataset splits. |
| Hardware Specification | No | The paper focuses on theoretical analysis and mathematical proofs; therefore, no specific hardware used for experiments is mentioned. |
| Software Dependencies | No | The paper is a theoretical work and does not describe any experimental implementation or software dependencies with version numbers. |
| Experiment Setup | No | The paper focuses on theoretical modeling and proofs, not on experimental results. Therefore, it does not provide details on experimental setup, hyperparameters, or training configurations. |