Transformers Learn Low Sensitivity Functions: Investigations and Implications

Authors: Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang, Vatsal Sharan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we first conduct experiments on vision datasets. We empirically compare (Vision-)Transformers with MLPs, CNNs, and Conv Mixers, and observe that transformers have lower sensitivity compared to other candidate architectures (see Section 4). Similarly, we conduct experiments on language tasks and observe that transformers learn predictors with lower sensitivity than LSTM models. Furthermore, transformers tend to have uniform sensitivity to all tokens while LSTMs are more sensitive to more recent tokens (see Section 5).
Researcher Affiliation Academia Bhavya Vasudeva , Deqing Fu , Tianyi Zhou, Elliott Kau , Youqi Huang , Vatsal Sharan University of Southern California EMAIL, EMAIL
Pseudocode No The paper describes methods and processes through definitions, textual descriptions, and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1The code is available at https://github.com/estija/sensitivity.
Open Datasets Yes We consider three datasets in this section (see Appendix D for details), namely CIFAR-10 (Krizhevsky, 2009), Image Net-1k (Russakovsky et al., 2015) and Fashion-MNIST (Xiao et al., 2017). ... We consider two binary classification datasets, MRPC (Dolan & Brockett, 2005) and QQP (Iyer et al., 2017) (see Appendix D for details)... Fig. 16 shows the training accuracy and sensitivity of a Res Net-18 and a Vi T-small model trained on SVHN dataset (Netzer et al., 2011). ... we now consider a slightly more complicated dataset, namely MNIST (Le Cun & Cortes, 2005).
Dataset Splits Yes Fashion-MNIST (Xiao et al., 2017) consists of 28 28 grayscale images of Zalando s articles. This is a 10-class classification task with 60k training and 10k test images. ... The CIFAR-10 dataset (Krizhevsky, 2009) ... There are 50k training and 10k test images. ... SVHN. Street View House Numbers (SVHN) (Netzer et al., 2011) ... There are 60k images in the train set and 10k images in the test set. ... MRPC. Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005) ... It has 4076 training examples and 1725 validation examples. ... QQP. Quora Question Pairs (QQP) (Iyer et al., 2017) ... It has 364k training examples and 40k validation examples.
Hardware Specification Yes Experiments on vision and language tasks were run on internal clusters using NVIDIA RTX A6000 GPUs with 48GB of VRAM. For the experiments on vision data, we use two GPUs and the runtime for each setting is about 17 hours. Experiments on language tasks use one GPU and the runtime for each experiment is about 24 hours.
Software Dependencies No We use Py Torch (Paszke et al., 2019) as our code framework and as our implementation of LSTMs. Py Torch is licensed under the Modified BSD license. The paper mentions PyTorch as a framework but does not specify its version number, nor does it list versions for any other key software dependencies.
Experiment Setup Yes We use standard SGD training with batch size 100. ... All the models are trained with SGD using batch size 50 for MNIST and 100 for the other datasets. We use patch size 7 for MNIST and 4 for the other datasets. ... We train both models with a learning rate of 0.01. ... We use learning rates of 0.1 for the MLP with Leaky Re LU, 0.5 for the MLP with sigmoid, 0.005 for the CNN and 0.1 for the Vi T. ... The learning rate is set as 0.1 for Vi T-small, 0.2 for Vi T-simple, 0.06 for Conv Mixer, 0.001 for Res Net-18 and 0.005 for Dense Net-121. ... We use the Adam W optimizer with a learning rate of 0.0001 and weight decay of 0.0001 for all the tasks. We also use a dropout rate of 0.1. We use a batch size of 32 for all the experiments.