reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Demystifying Long Chain-of-Thought Reasoning

Authors: Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we systematically investigate the underlying mechanics of long Co T reasoning examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings
Researcher Affiliation	Collaboration	Shiming Yang * 1 2 Yuxuan Tong * 3 Xinyao Niu 1 Graham Neubig 4 Xiang Yue * 4 1IN.AI 2English Name: Edward Yeo 3Tsinghua University. Work started when interning at CMU. 4Carnegie Mellon University. Correspondence to: Xiang Yue <EMAIL>.
Pseudocode	Yes	Algorithm 1 N-gram Repetition Penalty Algorithm 2 Action Prompting State Machine
Open Source Code	Yes	Our code is available at: https://github.com/eddycmu/demystify-long-cot.
Open Datasets	Yes	For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default, with which verifiable ground truth answers are provided. We focus on four representative reasoning benchmarks: MATH-500, AIME 2024, Theorem QA (Chen et al., 2023), and MMLU-Pro-1k (Wang et al., 2024a). We also discuss data like Web Instruct (Yue et al., 2024) that is more diverse but without gold supervision signals like ground truth answers in 5.
Dataset Splits	Yes	For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default, with which verifiable ground truth answers are provided. For efficiency, we adopt MATH-500, a widely-used i.i.d. subset of its test split. For efficiency, we adopt an 1,000-sample i.i.d. subset of its test split, called MMLU-Pro-1k.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. It mentions using 'Flash Attention 2' and 'Deep Speed library' which are typically associated with GPU usage, but no specific hardware model is identified.
Software Dependencies	No	The paper mentions several software components like 'Open RLHF framework (Hu et al., 2024)', 'vLLM library', 'SymEval', 'Flash Attention 2 (Dao, 2024)', 'Deep Speed library (Rasley et al., 2020)', 'SGLang (Zheng et al., 2024)', and 'Qwen2.5-7B-Instruct'. However, it does not provide specific version numbers for these libraries, frameworks, or programming languages used (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup	Yes	2.4. Training Setup: We adopt Llama-3.1-8B (Meta, 2024) and Qwen2.5 -7B-Math (Qwen Team, 2024a) as the base models. For both SFT and RL, we use the 7,500-sample prompt set of MATH (Hendrycks et al., 2021) training split by default... We train the models with the Open RLHF framework (Hu et al., 2024). 2.5. Evaluation: we evaluate the models using a temperature of t = 0.7, a top-p value of 0.95, and a maximum output length of 16,384 tokens. Table 7. SFT Hyperparameters: Batch Size 256, Context Length 128K, LR 5e-6, Epochs 2. Appendix E.5 contains detailed hyperparameters for various RL experiments in tables, including 'Base Model', 'Rewards', 'GAE', 'Episodes', 'Samples', 'BS', 'Epochs', 'Context Length', 'LR', 'KL'.