Implicit Bias and Fast Convergence Rates for Self-attention
Authors: Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Throughout, we validate our findings on both synthetic and real datasets (see Section 5). To complement our theory, we present experiments on synthetic/ real-world data demonstrating that (S)NGDbased training leads to faster convergence for various metrics compared to vanilla (S)GD. |
| Researcher Affiliation | Academia | Bhavya Vasudeva EMAIL University of Southern California Puneesh Deora EMAIL University of British Columbia Christos Thrampoulidis EMAIL University of British Columbia |
| Pseudocode | No | The paper describes methods and proofs using mathematical notation and prose but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using third-party libraries and implementations (PyTorch, ViT-small model from GitHub) but does not provide specific access information or a statement about releasing the authors' own source code for the described methodology. |
| Open Datasets | Yes | Fig. 1: ...finetuning a pre-trained BERT model on the MNLI dataset... Fig. 4: We use the Civil Comments dataset (Borkan et al., 2019)... Fig. 6: We use the MNIST dataset (Le Cun & Cortes, 2005)... Fig. 7: We consider the CIFAR-10 dataset (Krizhevsky, 2009)... |
| Dataset Splits | No | The paper mentions using datasets like MNLI, Civil Comments, MNIST, and CIFAR-10, but it does not explicitly state the training, validation, and test splits (e.g., percentages or sample counts) used for these datasets. |
| Hardware Specification | Yes | The experiments on vision and language datasets were run on an internal cluster with two NVIDIA V100 GPUs with 32 GB memory each. |
| Software Dependencies | No | The paper mentions using PyTorch and Hugging Face pytorch-transformers but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We use batch-size 32 to train all models. Learning rates are Adam: 2e 5, SGD: 1e 3, SNGD: 0.01. The ηmax for SPS and SNGD is set to 0.1. We use patch-size 4, and set depth as 2, number of heads as 8 and MLP width as 128. All models are trained with a batch-size of 100. Learning rates are set as follows. SGD: 0.1, SNGD: 0.001, Adam: 0.001. ηmax for SPS and SNGD is set to 0.01. |