On the Optimization and Generalization of Multi-head Attention

Authors: Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, Christos Thrampoulidis

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we provide some experiments discussing the role of number of heads H in the training dynamics on synthetic data models. Data Model DM1 ... We use n = 100 training samples in each experiment and evaluate on a test set of size 300 (total 5 trials). All models are initialized as θ(0) = 0. Figure 2 shows the effect of increasing the number of heads when running GD with constant step-size η = 1.0 and data generated from data model DM1. ... Planted data model ... The train set contains n = 1000 samples in each experiment and we evaluate on a test set of size 3000. Each result is averaged over 5 trials. ... SST2 dataset We conduct an additional experiment on a simple real-world dataset. The SST2 dataset (Socher et al., 2013) consists of sentences, with each sentence having a associated binary label to classify the sentiment. We fine-tune RoBERTa based models with varying number of heads using Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 5e 6.
Researcher Affiliation Academia Puneesh Deora EMAIL University of British Columbia Rouzbeh Ghaderi EMAIL University of British Columbia Hossein Taheri EMAIL University of California, Santa Barbara Christos Thrampoulidis EMAIL University of British Columbia
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described in mathematical formulations and prose.
Open Source Code No The paper does not provide an explicit statement about releasing code or a link to a code repository for the methodology described. It mentions using 'Hugging Face pytorch-transformers implementation of the roberta-base model', which is a third-party tool.
Open Datasets Yes SST2 dataset We conduct an additional experiment on a simple real-world dataset. The SST2 dataset (Socher et al., 2013) consists of sentences, with each sentence having a associated binary label to classify the sentiment.
Dataset Splits Yes Data Model DM1 We set the number of tokens T = 10 and sparsity level ζ = 0.1. ... We use n = 100 training samples in each experiment and evaluate on a test set of size 300 (total 5 trials). ... Planted data model ... The train set contains n = 1000 samples in each experiment and we evaluate on a test set of size 3000.
Hardware Specification No The authors also acknowledge use of the Sockeye cluster by UBC Advanced Research Computing. However, specific hardware details such as GPU/CPU models or memory configurations are not provided.
Software Dependencies No We use the Hugging Face pytorch-transformers implementation of the roberta-base model, with pretrained weights (Liu et al., 2019). No specific version numbers for PyTorch, Hugging Face Transformers, or Python are mentioned.
Experiment Setup Yes Figure 2: ... trained with GD for constant step-size η = 1.0. ... Figure 3: ... trained with GD when scaling step-size as η = O (H); (right) trained with Adam with constant step-size η = 0.06. ... All models are initialized as θ(0) = 0. ... We fine-tune RoBERTa based models with varying number of heads using Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 5e 6. We train all the models for 5 epochs, with the batch-size set to 32.