Soft Merging of Experts with Adaptive Routing
Authors: Mohammed Muqeeth, Haokun Liu, Colin Raffel
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper... We empirically validate that models using SMEAR outperform models that route based on metadata or learn routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization. All of the code used in our experiments is publicly available... We perform experiments in two real-world settings that differ in model architecture and modality. |
| Researcher Affiliation | Academia | Mohammed Muqeeth EMAIL University of North Carolina at Chapel Hill Hoakun Liu EMAIL University of Toronto Vector Institute Colin Raffel EMAIL University of Toronto Vector Institute |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations (e.g., in Section 3, "More explicitly, we define SMEAR as computing the output of an expert routing block using a merged expert computed as f(u, P i R(v)iθi)") but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All of the code used in our experiments is publicly available.1 1https://github.com/r-three/smear |
| Open Datasets | Yes | Specifically, we experiment with fine-tuning T5.1.1 Base (Raffel et al., 2020) on datasets from GLUE (Wang et al., 2018) (referred to as T5-GLUE) and fine-tuning a Res Net18 (He et al., 2016) on Domain Net (Peng et al., 2019) (Res Net-Domain Net). GLUE consists of nine datasets (SST-2 (Socher et al., 2013), Co LA (Warstadt et al., 2019)), MNLI (Williams et al., 2017), RTE (Bentivogli et al., 2009), QQP (Shankar et al., 2017), MRPC (Dolan & Brockett, 2005), STS-B (Cer et al., 2017), QNLI (Rajpurkar et al., 2016), and WNLI (Levesque et al., 2012))... Domain Net is a collection of object recognition datasets... (Peng et al., 2019). |
| Dataset Splits | Yes | We follow the approach of Mahabadi et al. (2021) for splitting each GLUE dataset into train, eval, and test splits. |
| Hardware Specification | Yes | All models were trained on 48GB A6000s, except for the Ensemble method, which was trained on 80GB A100s. |
| Software Dependencies | No | The paper mentions specific models and optimizers (e.g., T5.1.1 Base, ResNet18, Adam W optimizer, Adam optimizer, ST-Gumbel estimator, REINFORCE estimator, DSelect-k method) but does not provide specific version numbers for underlying software libraries or programming languages (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | T5 models were trained for 600k steps using a learning rate of 3e 4, with 2k warmup steps, and batch size of 128. The Adam W optimizer was used with its default settings. We ran the ST-Gumbel estimator with a τ value of 10 and an anneal rate of 1e 6... For the REINFORCE estimator, we used the same values as in Clark et al. (2022), α = 1e 2, β = 5e 4, and γ = 1e 2... Res Net models were trained for 100k steps with batch size of 128 and a learning rate of 1e 3, with no warm up, using Adam optimizer. We used τ value of 10 and anneal rate of 1e 4 for the ST-Gumbel estimator... The hyperparameter that weighs entropy regularization in Dselect-k is chosen as 0.1. |