Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers
Authors: Alireza Amiri Bavandpour, Xinting Huang, Mark Rofin, Michael Hahn
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments We trained transformers on PARITY on lengths between 1 and 500 (Figure 2 and Appendix C.1.2; results averaged across three runs). We considered the full Co T from Figure 1 (dots along diagonal line). We also considered Co Ts where only every k-th step of the Co T was provided, shortening the Co T by 1/k (dots below diagonal line). Across input lengths, the model is successful when k 3. We also verified that two LLMs (Deep Seek-R1, Guo et al. (2025), and o1-mini, Jaech et al. (2024)) produce Co Ts of at least linear length (Figure 6). |
| Researcher Affiliation | Academia | 1Sharif University of Technology. Work done in part while interning at Saarland University. 2Saarland University. Correspondence to: Michael Hahn <EMAIL>. |
| Pseudocode | No | The paper describes algorithms like the Schönhage-Strassen algorithm and breadth-first search but does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | 1Code link: https://github.com/lacoco-lab/scratchpad_bounds |
| Open Datasets | No | The paper generates custom datasets for its experiments, such as 'random DAGs' and 'random bitstrings' or 'random pairs of numbers' for arithmetic problems, but does not provide specific access information (link, DOI, citation) to these generated datasets or any other public datasets used. For example: 'Data generation. Each random DAG is generated by sampling a random lower triangular adjacency matrix and instantiating a DAG from it.' |
| Dataset Splits | Yes | For the Co T multiplication experiments, we used the following training configuration: Number of training epochs: 20 Batch size (per device): 8 for both training and evaluation Gradient accumulation steps: 4 Learning rate: 0.0001 |
| Hardware Specification | Yes | All experiments were conducted on NVIDIA A100 GPUs (40GB memory each). |
| Software Dependencies | No | The paper mentions using architectures like 'GPT-2 architecture' and 'BART architecture' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We trained a 2-layer 2-head transformer using the GPT-2 architecture (Radford et al., 2019b), with d = 256. Each model was trained at a fixed input and Co T length. Each training and testing input was generated randomly on the fly. Training was for 30K steps, at a batch size of 64, with Adam W (learning rate 1e-4). |