reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implicit Bias and Fast Convergence Rates for Self-attention

Authors: Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Throughout, we validate our findings on both synthetic and real datasets (see Section 5). To complement our theory, we present experiments on synthetic/ real-world data demonstrating that (S)NGDbased training leads to faster convergence for various metrics compared to vanilla (S)GD.
Researcher Affiliation	Academia	Bhavya Vasudeva EMAIL University of Southern California Puneesh Deora EMAIL University of British Columbia Christos Thrampoulidis EMAIL University of British Columbia
Pseudocode	No	The paper describes methods and proofs using mathematical notation and prose but does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using third-party libraries and implementations (PyTorch, ViT-small model from GitHub) but does not provide specific access information or a statement about releasing the authors' own source code for the described methodology.
Open Datasets	Yes	Fig. 1: ...finetuning a pre-trained BERT model on the MNLI dataset... Fig. 4: We use the Civil Comments dataset (Borkan et al., 2019)... Fig. 6: We use the MNIST dataset (Le Cun & Cortes, 2005)... Fig. 7: We consider the CIFAR-10 dataset (Krizhevsky, 2009)...
Dataset Splits	No	The paper mentions using datasets like MNLI, Civil Comments, MNIST, and CIFAR-10, but it does not explicitly state the training, validation, and test splits (e.g., percentages or sample counts) used for these datasets.
Hardware Specification	Yes	The experiments on vision and language datasets were run on an internal cluster with two NVIDIA V100 GPUs with 32 GB memory each.
Software Dependencies	No	The paper mentions using PyTorch and Hugging Face pytorch-transformers but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We use batch-size 32 to train all models. Learning rates are Adam: 2e 5, SGD: 1e 3, SNGD: 0.01. The ηmax for SPS and SNGD is set to 0.1. We use patch-size 4, and set depth as 2, number of heads as 8 and MLP width as 128. All models are trained with a batch-size of 100. Learning rates are set as follows. SGD: 0.1, SNGD: 0.001, Adam: 0.001. ηmax for SPS and SNGD is set to 0.01.