DEPT: Decoupled Embeddings for Pre-training Language Models

Authors: Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, William Shen, Xinchi Qiu, Dongqi Cai, Yan Gao, Nic Lane

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate DEPT s robustness to multilingual and multi-domain data heterogeneity. As shown in Fig. 2, DEPT resists activation divergence and model norm increases, which can halt perplexity improvements or cause divergence (Zhang et al., 2022; Chowdhery et al., 2023; Wortsman et al., 2024). When using the same local hyperparameters as the baselines, models trained with all DEPT variants maintain lower activation norms due to the regularization effects of Outer Opt (Algorithm 1). Learning rates for baselines are reduced for later comparisons to ensure convergence.
Researcher Affiliation Collaboration Alex Iacob ,1,2, * Lorenzo Sani1,2,* Meghdad Kurmanji1 William F. Shen1,** Xinchi Qiu1,** Dongqi Cai1, 3, ** Yan Gao1,2, ** Nicholas D. Lane1,2, ** ... 1University of Cambridge; 2Flower Labs; 3Beijing University of Posts and Telecommunications.
Pseudocode Yes Algorithm 1 Decoupled Embedding for Pre-Training (DEPT) variants: GLOB TRIM SPEC
Open Source Code No Our software is based on the Mosaic ML composer (Databricks, 2024) library for LLM pre-training and the open-source Flower (Beutel et al., 2022) framework for federated learning. Crucially, we heavily rely on the Mosaic ML hyperparameters and infrastructure for our Inner OPT, making no changes to it after our embedding-matrix manipulation from Algorithm 1 has been performed.
Open Datasets Yes To evaluate DEPT on multi-domain data, we use The Pile (Gao et al., 2021), which includes 22 subsets. ... For multilingual data, we use MC4 (Xue et al., 2021) with a mix of high, medium, and low-resource languages: English (EN), Italian (IT), and Chinese (ZH) as high-resource; Serbian (SR) and Malay (MS) as medium-resource; and Swahili (SW), Urdu (UR), and Latin (LA) as low-resource.
Dataset Splits Yes We assess in-domain generalization by evaluating the perplexity of a model on the test set of each training data source, while OOD generalization is evaluated with unseen datasets. Furthermore, we evaluate DEPT s efficacy in building foundation models through downstream tasks: Natural Language Inference via MNLI (Williams et al., 2018), Question Answering via RACE (Lai et al., 2017), Sentence Similarity via STSB (Cer et al., 2017), and Sentence Classification via SST-2 (Socher et al., 2013) Since we use decoder-only models below the model-size threshold for in-context learning abilities (Brown et al., 2020), we follow Radford et al. (2018) for fine-tuning. The evaluation metrics are accuracy (MNLI, RACE, SST-2) and Pearson correlation (STSB). The full details are in Appendix E.
Hardware Specification Yes In terms of hardware, the low communication properties of DEPT allowed us to run experiments via a mixture of loaned resources from separate cloud providers. Over the course of our experimentation, we used various machines equipped with either 1 H100 or 1 A100 GPU in the USA, Canada, and Europe, which turned out to be more cost-effective. We rented machines with 4-8 H100 GPUs for the centralized baselines since we could not use Distributed Data Parallelism techniques over low-bandwidth internet connections.
Software Dependencies No Our software is based on the Mosaic ML composer (Databricks, 2024) library for LLM pre-training and the open-source Flower (Beutel et al., 2022) framework for federated learning. Crucially, we heavily rely on the Mosaic ML hyperparameters and infrastructure for our Inner OPT, making no changes to it after our embedding-matrix manipulation from Algorithm 1 has been performed.
Experiment Setup Yes Full experimental details on our architecture, training hyperparameters (Tables 2 and 8), dataset, and baseline implementation are in Appendix A. ... Table 8 presents the vocabulary-agnostic hyperparameters of our decoder-only models, while Table 9 details vocabulary sizes, DEPT-specific parameters, memory costs, and communication costs. Standard pre-training pipeline parameters were chosen based on the recommendations of Hoffmann et al. (2022) and Mosaic ML, except for the billion-scale model, where we aligned with the recent state-of-the-art (SOTA) for English federated pre-training by Sani et al. (2024). We always use a gradient clipping norm of 1 and ALi Bi (Press et al., 2022) positional embeddings. ... Multi-domain 12 768 86.4M 12 4 256 4/16 (0.9, 0.95) (10 1, 6.0 10 4, 5 103) 1.2 103 4.5 10 4 4.5 10 4 5.0 10 4