Positional Attention: Expressivity and Learnability of Algorithmic Computation
Authors: Artur Back De Luca, George Giapitzakis, Shenghao Yang, Petar Veličković, Kimon Fountoulakis
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that Transformers with positional attention (positional Transformers) maintain the same expressivity of parallel computational models... Finally, we empirically explore the out-of-distribution performance of positional Transformers and find that they perform well in tasks where their underlying algorithmic solution relies on positional information." and "7. Experiments. In this section, we evaluate the performance of positional and standard Transformers across various algorithmic tasks and regimes. |
| Researcher Affiliation | Collaboration | 1David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada. 2Google Deep Mind, London, UK. |
| Pseudocode | Yes | Input: Data = (Data1, . . . , Datan) distributed across the memories of n machines, labeled in [n]. An oracle RCVn,Data : [R] [n] [n] (P([s]) \ { }). 1 For each round r = 1, . . . , R then 2 Each machine i simultaneously queries RCVn,Data with (r, i) as input and receives data from the machines and memory positions returned by the oracle. 3 The machines simultaneously perform local computations on the received data and write the results in their local memories. |
| Open Source Code | No | The paper does not contain an unambiguous statement of code release, a direct link to a code repository for the methodology described, or a clear statement that code is provided in supplementary material. |
| Open Datasets | No | The training data is sampled in the range [ 2, 2]. To ensure diversity in the data, for each input sample, we first select lower and upper bounds ̸̨̣l and ̸̨̣u uniformly in [ 2, 2], and then for each of the n elements of the input sample, we select its value uniformly in [̸̨̣l, ̸̨̣u]." and similar descriptions for other tasks. No concrete access information for datasets is provided. |
| Dataset Splits | Yes | We train models for each fixed length n {2, 4, 8, 16, 32} using 30,000 samples across all settings... We report the performance in-distribution and out-of-distribution using 1,000 test samples." (from Section D.1.2) and "We train models with 500,000 samples and ensure that all input lengths are equally represented. We then evaluate the in-distribution (ID) and out-of-distribution (OOD) loss across different scale factors c {1, 2, . . . , 10}. The losses reported are calculated using 3,000 samples for each scale." (from Section D.1.4). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Adam' for optimization but does not provide specific software dependencies like programming languages, libraries, or frameworks with version numbers. |
| Experiment Setup | Yes | Both models are trained end-to-end using the squared loss between the predicted and target vectors of size n, with no intermediate supervision. We train models with Adam, starting with a learning rate of 5 10 4 and a learning rate scheduler for a total of 2000 epochs." (from Section D.1.1). Additional details on layers, attention heads, MLP configurations, embedding dimensions, batch size, etc., are also provided across experimental setting sections. |