Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

Authors: Alireza Amiri Bavandpour, Xinting Huang, Mark Rofin, Michael Hahn

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments We trained transformers on PARITY on lengths between 1 and 500 (Figure 2 and Appendix C.1.2; results averaged across three runs). We considered the full Co T from Figure 1 (dots along diagonal line). We also considered Co Ts where only every k-th step of the Co T was provided, shortening the Co T by 1/k (dots below diagonal line). Across input lengths, the model is successful when k 3. We also verified that two LLMs (Deep Seek-R1, Guo et al. (2025), and o1-mini, Jaech et al. (2024)) produce Co Ts of at least linear length (Figure 6).
Researcher Affiliation Academia 1Sharif University of Technology. Work done in part while interning at Saarland University. 2Saarland University. Correspondence to: Michael Hahn <EMAIL>.
Pseudocode No The paper describes algorithms like the Schönhage-Strassen algorithm and breadth-first search but does not present them in a structured pseudocode or algorithm block.
Open Source Code Yes 1Code link: https://github.com/lacoco-lab/scratchpad_bounds
Open Datasets No The paper generates custom datasets for its experiments, such as 'random DAGs' and 'random bitstrings' or 'random pairs of numbers' for arithmetic problems, but does not provide specific access information (link, DOI, citation) to these generated datasets or any other public datasets used. For example: 'Data generation. Each random DAG is generated by sampling a random lower triangular adjacency matrix and instantiating a DAG from it.'
Dataset Splits Yes For the Co T multiplication experiments, we used the following training configuration: Number of training epochs: 20 Batch size (per device): 8 for both training and evaluation Gradient accumulation steps: 4 Learning rate: 0.0001
Hardware Specification Yes All experiments were conducted on NVIDIA A100 GPUs (40GB memory each).
Software Dependencies No The paper mentions using architectures like 'GPT-2 architecture' and 'BART architecture' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We trained a 2-layer 2-head transformer using the GPT-2 architecture (Radford et al., 2019b), with d = 256. Each model was trained at a fixed input and Co T length. Each training and testing input was generated randomly on the fly. Training was for 30K steps, at a batch size of 64, with Adam W (learning rate 1e-4).