Demystifying Long Chain-of-Thought Reasoning
Authors: Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we systematically investigate the underlying mechanics of long Co T reasoning examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings |
| Researcher Affiliation | Collaboration | Shiming Yang * 1 2 Yuxuan Tong * 3 Xinyao Niu 1 Graham Neubig 4 Xiang Yue * 4 1IN.AI 2English Name: Edward Yeo 3Tsinghua University. Work started when interning at CMU. 4Carnegie Mellon University. Correspondence to: Xiang Yue <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 N-gram Repetition Penalty Algorithm 2 Action Prompting State Machine |
| Open Source Code | Yes | Our code is available at: https://github.com/eddycmu/demystify-long-cot. |
| Open Datasets | Yes | For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default, with which verifiable ground truth answers are provided. We focus on four representative reasoning benchmarks: MATH-500, AIME 2024, Theorem QA (Chen et al., 2023), and MMLU-Pro-1k (Wang et al., 2024a). We also discuss data like Web Instruct (Yue et al., 2024) that is more diverse but without gold supervision signals like ground truth answers in 5. |
| Dataset Splits | Yes | For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default, with which verifiable ground truth answers are provided. For efficiency, we adopt MATH-500, a widely-used i.i.d. subset of its test split. For efficiency, we adopt an 1,000-sample i.i.d. subset of its test split, called MMLU-Pro-1k. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. It mentions using 'Flash Attention 2' and 'Deep Speed library' which are typically associated with GPU usage, but no specific hardware model is identified. |
| Software Dependencies | No | The paper mentions several software components like 'Open RLHF framework (Hu et al., 2024)', 'vLLM library', 'SymEval', 'Flash Attention 2 (Dao, 2024)', 'Deep Speed library (Rasley et al., 2020)', 'SGLang (Zheng et al., 2024)', and 'Qwen2.5-7B-Instruct'. However, it does not provide specific version numbers for these libraries, frameworks, or programming languages used (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | 2.4. Training Setup: We adopt Llama-3.1-8B (Meta, 2024) and Qwen2.5 -7B-Math (Qwen Team, 2024a) as the base models. For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default... We train the models with the Open RLHF framework (Hu et al., 2024). 2.5. Evaluation: we evaluate the models using a temperature of t = 0.7, a top-p value of 0.95, and a maximum output length of 16,384 tokens. Table 7. SFT Hyperparameters: Batch Size 256, Context Length 128K, LR 5e-6, Epochs 2. Appendix E.5 contains detailed hyperparameters for various RL experiments in tables, including 'Base Model', 'Rewards', 'GAE', 'Episodes', 'Samples', 'BS', 'Epochs', 'Context Length', 'LR', 'KL'. |