reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Merging Text Transformer Models from Different Initializations

Authors: Neha Verma, Maha Elbayad

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging, across models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.
Researcher Affiliation	Collaboration	Neha Verma EMAIL Center for Language and Speech Processing, Johns Hopkins University Maha Elbayad EMAIL Meta
Pseudocode	Yes	Algorithm 1 Multi-Headed Attention Permutation Input: Correlation Matrix C, number of heads, h for i = 1 to h do for j = i to h do πij, costs(i, j) = Linear Sum Assignment(Cij) end for end for πouter = Linear Sum Assignment(costs) πfinal = concat(πi,πouter(i)) Output: πfinal
Open Source Code	Yes	We release our code at https://github.com/nverma1/merging-text-transformers
Open Datasets	Yes	Specifically, we consider 5 different BERT models, seeds 1 through 5, from the Multi BERTs reproductions (Devlin et al., 2019; Sellam et al., 2021). ... We use the General Language Understaning Evaluation (GLUE) benchmark (Wang et al., 2018) for our classification tasks, and exclude WNLI as in Devlin et al. (2019). ... For our experiments on masked language modeling, we use the validation set of the Wikitext-103 benchmark as our evaluation data Merity et al. (2016)3. For computing model activations, we extract a random sample of just over 1 million sentences of the Books corpus Zhu et al. (2015).
Dataset Splits	Yes	For our experiments on masked language modeling, we use the validation set of the Wikitext-103 benchmark as our evaluation data Merity et al. (2016)... For GLUE experiments, we use the full training data for each of the tasks to compute features, and the full validation sets to compute losses. The amount of data available for each task varies, and statistics are also reported in Table 4 in Appendix A. ... Table 4: Results on GLUE for both the original BERT model Devlin et al. (2019), and our reproduction across Multi BERTs models 1-5. ... Training instances ... Validation instances
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using 'BERT models' and 'SGD variant' but does not specify any software libraries or frameworks with their version numbers.
Experiment Setup	Yes	We consider 5 different BERT models, seeds 1 through 5... We report vanilla averaging as our main baseline for comparison, computed as θavg = 1/2(θA + θB). ... Specifically, we use 21 samples evenly spaced between λ = 0 and λ = 1, inclusive. ... To compute MLM loss/pseudo-perplexity, we use a masking probability of p = 0.15 across block sizes of 128 tokens. ... For classification tasks, we fine-tune each of the Multi BERTs models with a randomly initialized classification head, including pooling layer and classification layer weights. We keep the head initializations the same across models. ... We train MNLI-mismatched, QQP, QNLI, SST-2, Co LA, STS-B, and RTE tasks for 3 epochs, and we train MRPC for 5 epochs. We follow all other hyperparameters of the reproduction implemented in Ren et al. (2023).