reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Optimization and Generalization of Multi-head Attention

Authors: Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, Christos Thrampoulidis

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we provide some experiments discussing the role of number of heads H in the training dynamics on synthetic data models. Data Model DM1 ... We use n = 100 training samples in each experiment and evaluate on a test set of size 300 (total 5 trials). All models are initialized as θ(0) = 0. Figure 2 shows the effect of increasing the number of heads when running GD with constant step-size η = 1.0 and data generated from data model DM1. ... Planted data model ... The train set contains n = 1000 samples in each experiment and we evaluate on a test set of size 3000. Each result is averaged over 5 trials. ... SST2 dataset We conduct an additional experiment on a simple real-world dataset. The SST2 dataset (Socher et al., 2013) consists of sentences, with each sentence having a associated binary label to classify the sentiment. We fine-tune RoBERTa based models with varying number of heads using Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 5e 6.
Researcher Affiliation	Academia	Puneesh Deora EMAIL University of British Columbia Rouzbeh Ghaderi EMAIL University of British Columbia Hossein Taheri EMAIL University of California, Santa Barbara Christos Thrampoulidis EMAIL University of British Columbia
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described in mathematical formulations and prose.
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository for the methodology described. It mentions using 'Hugging Face pytorch-transformers implementation of the roberta-base model', which is a third-party tool.
Open Datasets	Yes	SST2 dataset We conduct an additional experiment on a simple real-world dataset. The SST2 dataset (Socher et al., 2013) consists of sentences, with each sentence having a associated binary label to classify the sentiment.
Dataset Splits	Yes	Data Model DM1 We set the number of tokens T = 10 and sparsity level ζ = 0.1. ... We use n = 100 training samples in each experiment and evaluate on a test set of size 300 (total 5 trials). ... Planted data model ... The train set contains n = 1000 samples in each experiment and we evaluate on a test set of size 3000.
Hardware Specification	No	The authors also acknowledge use of the Sockeye cluster by UBC Advanced Research Computing. However, specific hardware details such as GPU/CPU models or memory configurations are not provided.
Software Dependencies	No	We use the Hugging Face pytorch-transformers implementation of the roberta-base model, with pretrained weights (Liu et al., 2019). No specific version numbers for PyTorch, Hugging Face Transformers, or Python are mentioned.
Experiment Setup	Yes	Figure 2: ... trained with GD for constant step-size η = 1.0. ... Figure 3: ... trained with GD when scaling step-size as η = O (H); (right) trained with Adam with constant step-size η = 0.06. ... All models are initialized as θ(0) = 0. ... We fine-tune RoBERTa based models with varying number of heads using Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 5e 6. We train all the models for 5 epochs, with the batch-size set to 32.