Implicit Bias and Fast Convergence Rates for Self-attention

Authors: Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Throughout, we validate our findings on both synthetic and real datasets (see Section 5). To complement our theory, we present experiments on synthetic/ real-world data demonstrating that (S)NGDbased training leads to faster convergence for various metrics compared to vanilla (S)GD.
Researcher Affiliation Academia Bhavya Vasudeva EMAIL University of Southern California Puneesh Deora EMAIL University of British Columbia Christos Thrampoulidis EMAIL University of British Columbia
Pseudocode No The paper describes methods and proofs using mathematical notation and prose but does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using third-party libraries and implementations (PyTorch, ViT-small model from GitHub) but does not provide specific access information or a statement about releasing the authors' own source code for the described methodology.
Open Datasets Yes Fig. 1: ...finetuning a pre-trained BERT model on the MNLI dataset... Fig. 4: We use the Civil Comments dataset (Borkan et al., 2019)... Fig. 6: We use the MNIST dataset (Le Cun & Cortes, 2005)... Fig. 7: We consider the CIFAR-10 dataset (Krizhevsky, 2009)...
Dataset Splits No The paper mentions using datasets like MNLI, Civil Comments, MNIST, and CIFAR-10, but it does not explicitly state the training, validation, and test splits (e.g., percentages or sample counts) used for these datasets.
Hardware Specification Yes The experiments on vision and language datasets were run on an internal cluster with two NVIDIA V100 GPUs with 32 GB memory each.
Software Dependencies No The paper mentions using PyTorch and Hugging Face pytorch-transformers but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use batch-size 32 to train all models. Learning rates are Adam: 2e 5, SGD: 1e 3, SNGD: 0.01. The ηmax for SPS and SNGD is set to 0.1. We use patch-size 4, and set depth as 2, number of heads as 8 and MLP width as 128. All models are trained with a batch-size of 100. Learning rates are set as follows. SGD: 0.1, SNGD: 0.001, Adam: 0.001. ηmax for SPS and SNGD is set to 0.01.