Demystifying Long Chain-of-Thought Reasoning

Authors: Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we systematically investigate the underlying mechanics of long Co T reasoning examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings
Researcher Affiliation Collaboration Shiming Yang * 1 2 Yuxuan Tong * 3 Xinyao Niu 1 Graham Neubig 4 Xiang Yue * 4 1IN.AI 2English Name: Edward Yeo 3Tsinghua University. Work started when interning at CMU. 4Carnegie Mellon University. Correspondence to: Xiang Yue <EMAIL>.
Pseudocode Yes Algorithm 1 N-gram Repetition Penalty Algorithm 2 Action Prompting State Machine
Open Source Code Yes Our code is available at: https://github.com/eddycmu/demystify-long-cot.
Open Datasets Yes For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default, with which verifiable ground truth answers are provided. We focus on four representative reasoning benchmarks: MATH-500, AIME 2024, Theorem QA (Chen et al., 2023), and MMLU-Pro-1k (Wang et al., 2024a). We also discuss data like Web Instruct (Yue et al., 2024) that is more diverse but without gold supervision signals like ground truth answers in 5.
Dataset Splits Yes For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default, with which verifiable ground truth answers are provided. For efficiency, we adopt MATH-500, a widely-used i.i.d. subset of its test split. For efficiency, we adopt an 1,000-sample i.i.d. subset of its test split, called MMLU-Pro-1k.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. It mentions using 'Flash Attention 2' and 'Deep Speed library' which are typically associated with GPU usage, but no specific hardware model is identified.
Software Dependencies No The paper mentions several software components like 'Open RLHF framework (Hu et al., 2024)', 'vLLM library', 'SymEval', 'Flash Attention 2 (Dao, 2024)', 'Deep Speed library (Rasley et al., 2020)', 'SGLang (Zheng et al., 2024)', and 'Qwen2.5-7B-Instruct'. However, it does not provide specific version numbers for these libraries, frameworks, or programming languages used (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes 2.4. Training Setup: We adopt Llama-3.1-8B (Meta, 2024) and Qwen2.5 -7B-Math (Qwen Team, 2024a) as the base models. For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default... We train the models with the Open RLHF framework (Hu et al., 2024). 2.5. Evaluation: we evaluate the models using a temperature of t = 0.7, a top-p value of 0.95, and a maximum output length of 16,384 tokens. Table 7. SFT Hyperparameters: Batch Size 256, Context Length 128K, LR 5e-6, Epochs 2. Appendix E.5 contains detailed hyperparameters for various RL experiments in tables, including 'Base Model', 'Rewards', 'GAE', 'Episodes', 'Samples', 'BS', 'Epochs', 'Context Length', 'LR', 'KL'.