reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

Authors: Alireza Amiri Bavandpour, Xinting Huang, Mark Rofin, Michael Hahn

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments We trained transformers on PARITY on lengths between 1 and 500 (Figure 2 and Appendix C.1.2; results averaged across three runs). We considered the full Co T from Figure 1 (dots along diagonal line). We also considered Co Ts where only every k-th step of the Co T was provided, shortening the Co T by 1/k (dots below diagonal line). Across input lengths, the model is successful when k 3. We also verified that two LLMs (Deep Seek-R1, Guo et al. (2025), and o1-mini, Jaech et al. (2024)) produce Co Ts of at least linear length (Figure 6).
Researcher Affiliation	Academia	1Sharif University of Technology. Work done in part while interning at Saarland University. 2Saarland University. Correspondence to: Michael Hahn <EMAIL>.
Pseudocode	No	The paper describes algorithms like the Schönhage-Strassen algorithm and breadth-first search but does not present them in a structured pseudocode or algorithm block.
Open Source Code	Yes	1Code link: https://github.com/lacoco-lab/scratchpad_bounds
Open Datasets	No	The paper generates custom datasets for its experiments, such as 'random DAGs' and 'random bitstrings' or 'random pairs of numbers' for arithmetic problems, but does not provide specific access information (link, DOI, citation) to these generated datasets or any other public datasets used. For example: 'Data generation. Each random DAG is generated by sampling a random lower triangular adjacency matrix and instantiating a DAG from it.'
Dataset Splits	Yes	For the Co T multiplication experiments, we used the following training configuration: Number of training epochs: 20 Batch size (per device): 8 for both training and evaluation Gradient accumulation steps: 4 Learning rate: 0.0001
Hardware Specification	Yes	All experiments were conducted on NVIDIA A100 GPUs (40GB memory each).
Software Dependencies	No	The paper mentions using architectures like 'GPT-2 architecture' and 'BART architecture' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We trained a 2-layer 2-head transformer using the GPT-2 architecture (Radford et al., 2019b), with d = 256. Each model was trained at a fixed input and Co T length. Each training and testing input was generated randomly on the fly. Training was for 30K steps, at a batch size of 64, with Adam W (learning rate 1e-4).