Linear Transformer Topological Masking with Graph Random Features
Authors: Isaac Reid, Kumar Dubey, Deepali Jain, William Whitney, Amr Ahmed, Joshua Ainslie, Alex Bewley, Mithun George Jacob, Aranyak Mehta, David Rendleman, Connor Schenck, Richard E Turner, René Wagner, Adrian Weller, Krzysztof Choromanski
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate strong accuracy gains on image data, as well as for modelling the dynamics of massive point clouds (> 30k particles) in robotics applications where efficiency is essential. ... In this section, we test our algorithms for topological masking with GRFs. We consider data modalities with different graph topologies: images and point clouds. ... Table 1 shows the final test accuracies for Image Net (Deng et al., 2009), i Naturalist2021 (Horn et al., 2018) and Places365 (Zhou et al., 2018). |
| Researcher Affiliation | Collaboration | 1University of Cambridge, 2Google Research, 3Google Deep Mind, 4Alan Turing Institute, 5Columbia University |
| Pseudocode | Yes | Alg. 1 presents our method. ... Algorithm 1 O(N) topologically-masked attention for general graphs |
| Open Source Code | No | Reproducibility statement: We have made every effort to ensure the work s reproducibility. The core algorithm is presented clearly in Alg. 1. |
| Open Datasets | Yes | Table 1 shows the final test accuracies for Image Net (Deng et al., 2009), i Naturalist2021 (Horn et al., 2018) and Places365 (Zhou et al., 2018). ... We train and evaluate on the Kinetics 400 benchmark (Kay et al., 2017). |
| Dataset Splits | No | The paper mentions standard datasets like Image Net, i Naturalist2021, Places365, and Kinetics 400 but does not explicitly state the dataset splits (e.g., percentages or sample counts) used for training, validation, or testing in the main text or supplementary tables. |
| Hardware Specification | No | For a hardware-agnostic comparison, we first compute the total number of FLOPs for evaluating (i) unmasked softmax, (ii) unmasked linear and (iii) GRFmasked linear attention for graphs of different sizes N. ... The paper does not explicitly mention specific hardware (e.g., GPU/CPU models, TPUs) used for running the experiments. It refers to a previous work (Whitney et al., 2024) for implementation details in one section, but does not specify hardware within this document. |
| Software Dependencies | No | The paper mentions using the Adam W optimiser (Loshchilov, 2017) and deep learning frameworks implicitly (e.g., PyTorch for ViT) but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | Table 2: Architecture, hyperparameters and training details for Vi T experiments. Num. layers 12 Num. heads 12 Num. patches 16 Hidden size 768 MLP dim. 3072 Optimiser Adam Epochs 90 Base learning rate 3 10 3 Final learning rate 1 10 5 Learning rate schedule Linear warmup (104 steps), constant, cosine decay Batch size 4096 ... all models are trained with a batch size of 16; we use the Adam W optimiser (Loshchilov, 2017) with weight decay 10 3, clipping the gradient norm to 0.01; models are trained with 6 step rollouts, with losses computed on 128 sampled rays; |