Transformer Meets Twicing: Harnessing Unattended Residual Information

Authors: Laziz Abdullaev, Tan Nguyen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the performance gains of our model over baseline transformers on multiple tasks and benchmarks, including image classification and language modeling, on both clean and corrupted data. The code is publicly available at https://github.com/lazizcodes/twicing_attention. In Section 4, we present our experimental results using Twicing Attention...
Researcher Affiliation Academia Laziz U. Abdullaev Department of Mathematics National University of Singapore EMAIL Tan M. Nguyen Department of Mathematics National University of Singapore EMAIL
Pseudocode No The paper formulates Twicing Attention as Definition 2 and describes its practical implementation in Remark 3, but it does not present these as a formal, structured pseudocode block or algorithm.
Open Source Code Yes The code is publicly available at https://github.com/lazizcodes/twicing_attention.
Open Datasets Yes Moreover, we empirically validate the performance improvements of Twicing Attention over standard self-attention in large-scale tasks such as Image Net-1K classification (Touvron et al., 2021), ADE20K image segmentation (Strudel et al., 2021) and Wiki Text-103 language modelling (Merity et al., 2016)...
Dataset Splits Yes We use the full Image Net dataset that contains 1.28M training images and 50K validation images. The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively. We follow the setup of (Han et al., 2023; Teo & Nguyen, 2025) and assess models by training them on clean data before attacking only the test data using an attack rate of 4%.
Hardware Specification Yes All models are trained using four NVIDIA A100 SXM4 40GB GPUs including both language and vision models. Imagenet Classification under adversarial attacks are evaluated using two NVIDIA A100 SXM4 40GB GPUs while only one of such GPUs was used to evaluate against Image Net A,R,C and Word Swap Attack for language modelling.
Software Dependencies No The paper mentions 'Adam using a starting learning rate of 0.00025 and cosine scheduling under default Py Torch settings' and 'Py Torch settings', indicating the use of PyTorch, but no specific version numbers for PyTorch or other libraries are provided.
Experiment Setup Yes We train using Adam with a starting learning rate of 0.0005 using cosine scheduling under default Py Torch settings, momentum of 0.9, batch size of 256, 5 warmup epochs starting from 0.000001 and 10 cooldown epochs, for an overall train run of 300 epochs. The input size is 224 and we follow the default Auto Augment policy and color jitter 0.4. To this end, the small backbone uses 16 layers, 8 heads of dimension 16, a feedforward layer of size 2048 and an embedding dimension of 128. We use a dropout rate of 0.1. We trained with Adam using a starting learning rate of 0.00025 and cosine scheduling under default Py Torch settings. We used a batch size of 96 and trained for 120 epochs and 2000 warmup steps. The train and evaluation target lengths were set to 256.