Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Merging Text Transformer Models from Different Initializations
Authors: Neha Verma, Maha Elbayad
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging, across models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models. |
| Researcher Affiliation | Collaboration | Neha Verma EMAIL Center for Language and Speech Processing, Johns Hopkins University Maha Elbayad EMAIL Meta |
| Pseudocode | Yes | Algorithm 1 Multi-Headed Attention Permutation Input: Correlation Matrix C, number of heads, h for i = 1 to h do for j = i to h do πij, costs(i, j) = Linear Sum Assignment(Cij) end for end for πouter = Linear Sum Assignment(costs) πfinal = concat(πi,πouter(i)) Output: πfinal |
| Open Source Code | Yes | We release our code at https://github.com/nverma1/merging-text-transformers |
| Open Datasets | Yes | Specifically, we consider 5 different BERT models, seeds 1 through 5, from the Multi BERTs reproductions (Devlin et al., 2019; Sellam et al., 2021). ... We use the General Language Understaning Evaluation (GLUE) benchmark (Wang et al., 2018) for our classification tasks, and exclude WNLI as in Devlin et al. (2019). ... For our experiments on masked language modeling, we use the validation set of the Wikitext-103 benchmark as our evaluation data Merity et al. (2016)3. For computing model activations, we extract a random sample of just over 1 million sentences of the Books corpus Zhu et al. (2015). |
| Dataset Splits | Yes | For our experiments on masked language modeling, we use the validation set of the Wikitext-103 benchmark as our evaluation data Merity et al. (2016)... For GLUE experiments, we use the full training data for each of the tasks to compute features, and the full validation sets to compute losses. The amount of data available for each task varies, and statistics are also reported in Table 4 in Appendix A. ... Table 4: Results on GLUE for both the original BERT model Devlin et al. (2019), and our reproduction across Multi BERTs models 1-5. ... Training instances ... Validation instances |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'BERT models' and 'SGD variant' but does not specify any software libraries or frameworks with their version numbers. |
| Experiment Setup | Yes | We consider 5 different BERT models, seeds 1 through 5... We report vanilla averaging as our main baseline for comparison, computed as θavg = 1/2(θA + θB). ... Specifically, we use 21 samples evenly spaced between λ = 0 and λ = 1, inclusive. ... To compute MLM loss/pseudo-perplexity, we use a masking probability of p = 0.15 across block sizes of 128 tokens. ... For classification tasks, we fine-tune each of the Multi BERTs models with a randomly initialized classification head, including pooling layer and classification layer weights. We keep the head initializations the same across models. ... We train MNLI-mismatched, QQP, QNLI, SST-2, Co LA, STS-B, and RTE tasks for 3 epochs, and we train MRPC for 5 epochs. We follow all other hyperparameters of the reproduction implemented in Ren et al. (2023). |