On the Optimization and Generalization of Multi-head Attention
Authors: Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, Christos Thrampoulidis
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we provide some experiments discussing the role of number of heads H in the training dynamics on synthetic data models. Data Model DM1 ... We use n = 100 training samples in each experiment and evaluate on a test set of size 300 (total 5 trials). All models are initialized as θ(0) = 0. Figure 2 shows the effect of increasing the number of heads when running GD with constant step-size η = 1.0 and data generated from data model DM1. ... Planted data model ... The train set contains n = 1000 samples in each experiment and we evaluate on a test set of size 3000. Each result is averaged over 5 trials. ... SST2 dataset We conduct an additional experiment on a simple real-world dataset. The SST2 dataset (Socher et al., 2013) consists of sentences, with each sentence having a associated binary label to classify the sentiment. We fine-tune RoBERTa based models with varying number of heads using Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 5e 6. |
| Researcher Affiliation | Academia | Puneesh Deora EMAIL University of British Columbia Rouzbeh Ghaderi EMAIL University of British Columbia Hossein Taheri EMAIL University of California, Santa Barbara Christos Thrampoulidis EMAIL University of British Columbia |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described in mathematical formulations and prose. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository for the methodology described. It mentions using 'Hugging Face pytorch-transformers implementation of the roberta-base model', which is a third-party tool. |
| Open Datasets | Yes | SST2 dataset We conduct an additional experiment on a simple real-world dataset. The SST2 dataset (Socher et al., 2013) consists of sentences, with each sentence having a associated binary label to classify the sentiment. |
| Dataset Splits | Yes | Data Model DM1 We set the number of tokens T = 10 and sparsity level ζ = 0.1. ... We use n = 100 training samples in each experiment and evaluate on a test set of size 300 (total 5 trials). ... Planted data model ... The train set contains n = 1000 samples in each experiment and we evaluate on a test set of size 3000. |
| Hardware Specification | No | The authors also acknowledge use of the Sockeye cluster by UBC Advanced Research Computing. However, specific hardware details such as GPU/CPU models or memory configurations are not provided. |
| Software Dependencies | No | We use the Hugging Face pytorch-transformers implementation of the roberta-base model, with pretrained weights (Liu et al., 2019). No specific version numbers for PyTorch, Hugging Face Transformers, or Python are mentioned. |
| Experiment Setup | Yes | Figure 2: ... trained with GD for constant step-size η = 1.0. ... Figure 3: ... trained with GD when scaling step-size as η = O (H); (right) trained with Adam with constant step-size η = 0.06. ... All models are initialized as θ(0) = 0. ... We fine-tune RoBERTa based models with varying number of heads using Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 5e 6. We train all the models for 5 epochs, with the batch-size set to 32. |