ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans
Authors: Ashkan Shahbazi, Elaheh Akbari, Darian Salehi, Xinran Liu, Navid Naderializadeh, Soheil Kolouri
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Vanderbilt University, Nashville, TN, USA. 2Department of Computer Science, Duke University, Durham, NC, USA. 3Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA. 4Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN, USA. |
| Pseudocode | Yes | The overall pipeline of ESPFormer can be found in Algorithm 1. For completeness, we also report results from explicitly learning the slices and observe no significant benefits compared to axis-aligned slices. Algorithm 1 ESPFormer s Doubly-Stochastic Attention input Query matrix Q Rm N, Key matrix K Rm N, Value Matrix V Rd N, Soft Sort hyperparameter t, and inverse temperature hyperparameter τ. output Attention-weighted output matrix 1: Calculate the pairwise distance matrix: [C]ij = Q:i K:j 2. 2: for l = 1 to m do 3: Soft Sort the projected samples using (7): Al = Soft Sortt(Ql:), Bl = Soft Sortt(Kl:). 4: Calculate the transport plan Ul = 1 N AT l Bl 5: Calculate Dl = P ij[C]ij[Ul]ij 6: end for 7: Calculate the στ = softmax(D; τ) 8: Aggregate the plans from all slices G = Pm l=1 στ l Ul 9: Return: V G |
| Open Source Code | Yes | Our implementation code can be found at https: //github.com/dariansal/ESPFormer. |
| Open Datasets | Yes | The Model Net40 dataset (Wu et al., 2015) comprises 40 widely recognized 3D object categories... We next evaluate ESPFormer on the IMDB dataset (Maas et al., 2011) for sentiment analysis. ... We additionally evaluate ESPFormer on the Tweet Eval sentiment dataset (Barbieri et al., 2020)... trained on the IWSLT 14 German-to-English dataset (Cettolo et al., 2014)... conducted experiments on the Cats and Dogs dataset (Kaggle, 2013)... trained Transformer, Diff Transformer, Sinkformer, and ESPFormer on the MNIST dataset (Le Cun, Bengio, and Haffner, 1998). |
| Dataset Splits | Yes | To evaluate the generalizability of the models under limited data scenarios, we conducted experiments on the Cats and Dogs dataset (Kaggle, 2013) using varying fractions of the training data: 1%, 10%, 25%, and 100%. |
| Hardware Specification | No | The paper does not explicitly state the specific hardware used for running its experiments (e.g., GPU models, CPU models, or memory details). |
| Software Dependencies | No | The paper mentions software like 'fairseq' and the 'Adam optimizer' (Kingma & Ba, 2015), and implicitly 'numpy' through code snippets, but it does not provide specific version numbers for these components to ensure reproducibility. For example, it mentions 'fairseq sequence modeling toolkit (Ott et al., 2019)' but not its version. |
| Experiment Setup | Yes | The training procedure employs a batch size of 64 and utilizes the Adam optimizer (Kingma & Ba, 2015). The network is trained for 300 epochs, with an initial learning rate of 10 3, which is reduced by a factor of 10 after 200 epochs. Table 8. Hyperparameters used in training for Set Transformers on Model Net40 dataset. |