Low-Rank Thinning

Authors: Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To gauge the practical effectiveness of Alg. 1, we recreate the benchmark Tokens-To-Token Vision Transformer (T2T-Vi T) and Big GAN image generation experiments of Zandieh et al. (2023). In the T2T-Vi T experiment, attention approximations are scored on their Image Net classification accuracy and computational expense when used as drop-in replacements for the two most expensive attention layers in a pretrained T2T-Vi T neural network (Yuan et al., 2021). In the Big GAN experiment, approximations are scored on their computational expense and two popular measures of image generation quality, the Frechet Inception Distance (FID, Heusel et al., 2017) and Inception Score (IS, Salimans et al., 2016). Using the exact implementations and settings provided by Zandieh et al. (2023), we benchmark our Py Torch implementation of Thinformer against exact attention and four leading attention approximations: Performer (Choromanski et al., 2021), Reformer (Kitaev et al., 2020), Scatter Brain (Chen et al., 2021), and KDEformer. In Tab. 3, we find that Thinformer (g = 2) provides the highest Top-1 accuracy on the Image Net 2012 validation set (Russakovsky et al., 2015), while running faster than all of the alternatives. In Tab. 4, Thinformer (g = 2) yields better FID and IS than all of the alternatives while running significantly faster than exact, KDEformer, Reformer, and Scatter Brain.
Researcher Affiliation Collaboration 1University of Cambridge 2Cornell Tech 3MIT 4Microsoft Research New England. Correspondence to: Annabelle Carrell <EMAIL>, Albert Gong <EMAIL>, Abhishek Shetty <EMAIL>, Raaz Dwivedi <EMAIL>, Lester Mackey <EMAIL>.
Pseudocode Yes Algorithm 1: Thinformer
Open Source Code Yes We provide Py Torch code replicating this experiment at https://github.com/microsoft/ thinformer and supplementary experiment details in App. L.1. See https://github.com/microsoft/ khsgd for Py Torch code replicating this experiment and App. L.2 for supplementary experiment details. See https://github.com/ microsoft/deepctt for Py Torch code replicating this experiment and App. L.3 for supplementary experiment details.
Open Datasets Yes In Tab. 3, we find that Thinformer (g = 2) provides the highest Top-1 accuracy on the Image Net 2012 validation set (Russakovsky et al., 2015), while running faster than all of the alternatives. when we recreate the Home Mortgage Disclosure Act logistic regression experiment of Cooper et al. (2023) with a single worker (Fig. 1) To evaluate the practical utility of deep kernel CTT, we follow the Higgs mixture experiment of Domingo-Enrich et al. (2023, Sec. 5)
Dataset Splits Yes In Tab. 3, we find that Thinformer (g = 2) provides the highest Top-1 accuracy on the Image Net 2012 validation set (Russakovsky et al., 2015), while running faster than all of the alternatives.
Hardware Specification Yes The experiment of Tab. 3 was carried out using Python 3.12.9, Py Torch 2.8.0.dev20250407+cu128 (Paszke et al., 2019), and an Ubuntu 22.04.5 LTS server with an AMD EPYC 7V13 64-Core Processor, 220 GB RAM, and a single NVIDIA A100 GPU (80 GB memory, CUDA 12.8, driver version 570.124.04). The experiment of Tab. 4 was carried out using Python 3.12.9, Py Torch 2.6.0, and an Ubuntu 22.04.5 LTS server with an Intel(R) Xeon(R) Gold 5218 CPU Processor, 100 GB RAM, and a single NVIDIA A6000 GPU (48 GB memory, CUDA 12.1, driver version 530.30.02).
Software Dependencies Yes The experiment of Tab. 3 was carried out using Python 3.12.9, Py Torch 2.8.0.dev20250407+cu128 (Paszke et al., 2019), and an Ubuntu 22.04.5 LTS server with an AMD EPYC 7V13 64-Core Processor, 220 GB RAM, and a single NVIDIA A100 GPU (80 GB memory, CUDA 12.8, driver version 570.124.04). The experiment of Tab. 4 was carried out using Python 3.12.9, Py Torch 2.6.0, and an Ubuntu 22.04.5 LTS server with an Intel(R) Xeon(R) Gold 5218 CPU Processor, 100 GB RAM, and a single NVIDIA A6000 GPU (48 GB memory, CUDA 12.1, driver version 530.30.02).
Experiment Setup Yes Table L.1: Configurations for the attention approximation methods of Tab. 3. Table L.2: Configurations for the attention approximation methods of Tab. 4. optimization was carried out with a learning rate of α = 0.01, datapoints were loaded in batches of size 16, and stochastic gradients were reordered for each datapoint individually. Each test is run with replication count B = 100, nominal level α = 0.05, and failure probability δ = 0.5. The neural network ϕ was trained exactly as in Liu et al. (2020) (with learning rate 5 10 5 and batch size equal to the full training sample size), and runtime measurements exclude the time required to train ϕ.