Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

Authors: Juno Kim, Denny Wu, Jason D. Lee, Taiji Suzuki

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Under this framework, we prove that implementing a search protocol that rewards sparse edges improves Co T by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed metastable representation of the reasoning dynamics can be distilled into a smaller, more efficient model. All proofs are deferred to the appendix. A discussion of additional related works is provided in Appendix A. Metastable dynamics and hitting times are studied in Appendices B-C, optimization dynamics are analyzed in Appendix D, and learning-theoretic lower bounds are given in Appendix E.
Researcher Affiliation Academia 1University of Tokyo and RIKEN AIP 2New York University and Flatiron Institute 3Princeton University.
Pseudocode Yes Algorithm 1 Two-stage Pretraining, Algorithm 2 Sparse Edge Search, Algorithm 3 PPO-Clip, Algorithm 4 Meta-chain Distillation
Open Source Code No The paper does not contain any explicit statements about releasing source code or links to a code repository.
Open Datasets No The paper introduces a perturbed Markov chain model for Co T reasoning that differentiates between easy and hard reasoning steps through a dense-sparse structure. It does not describe any experiments conducted on specific open datasets.
Dataset Splits No The paper does not conduct experiments on datasets, therefore, it does not specify any dataset splits.
Hardware Specification No The paper focuses on theoretical analysis and mathematical proofs; therefore, no specific hardware used for experiments is mentioned.
Software Dependencies No The paper is a theoretical work and does not describe any experimental implementation or software dependencies with version numbers.
Experiment Setup No The paper focuses on theoretical modeling and proofs, not on experimental results. Therefore, it does not provide details on experimental setup, hyperparameters, or training configurations.