Setting the Record Straight on Transformer Oversmoothing
Authors: Gbetondji Jean-Sebastien Dovonon, Michael M. Bronstein, Matt Kusner
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. |
| Researcher Affiliation | Academia | Gbètondji J-S Dovonon EMAIL University College London Michael Bronstein EMAIL University of Oxford Matt J. Kusner EMAIL Polytechnique Montréal, Mila Quebec AI Institute |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. It primarily uses mathematical equations and descriptive text. |
| Open Source Code | No | The paper mentions using existing codebases: "We use the original Dei T code and training recipe described above." and "We base our NLP experiments on Geiping & Goldstein (2023), using their code-base." However, there is no explicit statement about releasing the authors' own implementation code for the methodology described in this paper, nor is a direct link to a repository provided. |
| Open Datasets | Yes | We train sharpening and smoothing models on CIFAR100 (Krizhevsky et al., 2009), Image Net (Deng et al., 2009), and The Pile (Gao et al., 2020). We evaluate models on Super GLUE Wang et al. (2020) after fine-tuning for each task. |
| Dataset Splits | No | The paper mentions training on CIFAR100 for 300 epochs and using the Dei T training recipe, and evaluating on Super GLUE after fine-tuning. However, it does not explicitly state the training/validation/test splits used for these datasets, nor does it cite specific predefined splits. It relies on implicit standard splits without explicit mention. |
| Hardware Specification | Yes | The models were trained on two Nvidia RTX 2080 Ti GPUs. On Image Net, we use the original Dei T code and training recipe described above. Changes from CIFAR100 are that we use a batch size of 512 and train on a single Nvidia RTX 4090 GPU. In order to ensure a fair comparison, all models are trained on a reference system with an RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer and mixed precision training with bfloat16 but does not specify version numbers for any software libraries or dependencies used (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | On CIFAR100 for 300 epochs using the cross-entropy loss and the Adam W optimizer Loshchilov & Hutter (2019). Our setup is the one used in Park & Kim (2022) which itself follows the Dei T training recipe Touvron et al. (2021a). We use a cosine annealing schedule with an initial learning rate of 1.25 10 4 and weight decay of 5 10 2. We use a batch size of 96. We use data augmentation including Rand Augment Cubuk et al. (2019), Cut Mix Yun et al. (2019), Mixup Zhang et al. (2018), and label smoothing Touvron et al. (2021a). On Image Net, ... we use a batch size of 512. The batch size is 8192 and the sequence length is 128. We use mixed precision training with bfloat16. |