Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Max-Margin Token Selection in Attention Mechanism

Authors: Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we verify our theoretical findings via numerical experiments and provide insights. 4 Experiments
Researcher Affiliation Academia Davoud Ataee Tarzanagh University of Pennsylvania EMAIL Yingcong Li Xuechen Zhang University of California, Riverside EMAIL Samet Oymak University of Michigan UC Riverside EMAIL
Pseudocode No The paper describes algorithms and mathematical formulations but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes The code for experiments can be found at https://github.com/ucr-optml/max_margin_attention.
Open Datasets Yes To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3. [24] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55(5), 2014.
Dataset Splits No The paper mentions using the CIFAR-10 dataset but does not explicitly describe the training, validation, and test splits with specific percentages or sample counts.
Hardware Specification No The paper describes the experiments but does not specify any particular hardware used (e.g., GPU models, CPU types, or cloud compute instances).
Software Dependencies No The paper mentions using PyTorch for implementation but does not specify any software dependencies with version numbers.
Experiment Setup Yes During training, we use SGD optimizer with learning rate 0.1 and train the model for 1000 iterations. To study softmax sparsity and the evolution of attention weights throughout training, we train a vision transformer (Vi T-base) model [23] from scratch, utilizing the CIFAR10 dataset [24] for 400 epochs with fixed learning rate 3 10 3.