Fundamental Limitations on Subquadratic Alternatives to Transformers
Authors: Josh Alman, Hantao Yu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason cannot perform this task. |
| Researcher Affiliation | Academia | Josh Alman Department of Computer Science Columbia University New York, NY 10027, USA EMAIL Hantao Yu Department of Computer Science Columbia University New York, NY 10027, USA EMAIL |
| Pseudocode | No | The paper describes theoretical proofs and mathematical constructions but does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described using mathematical notation and natural language. |
| Open Source Code | No | The paper does not contain any statements about open-sourcing code, providing a repository link, or including code in supplementary materials. |
| Open Datasets | No | This is a theoretical paper focusing on computational complexity and proofs, not empirical evaluation using specific datasets. While it discusses 'document similarity tasks' and 'bag-of-words embedding' as concepts, it does not use or provide access to any specific datasets for experiments. |
| Dataset Splits | No | As this is a theoretical paper and does not conduct experiments with empirical data, there is no mention of dataset splits for training, validation, or testing. |
| Hardware Specification | No | This paper is theoretical and focuses on proofs and complexity analysis. It does not describe any experimental setup or the specific hardware used for computations. |
| Software Dependencies | No | This is a theoretical paper, and as such, it does not detail any software dependencies or versions used for conducting experiments or implementing models. |
| Experiment Setup | No | This is a theoretical paper presenting proofs and complexity results, not empirical experiments. Therefore, it does not provide details about an experimental setup, including hyperparameters or training configurations. |