No Need to Talk: Asynchronous Mixture of Language Models

Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on 75% of the tasks.
Researcher Affiliation Collaboration Anastasiia Filippova EPFL Angelos Katharopoulos Apple David Grangier Apple Ronan Collobert Apple
Pseudocode Yes Algorithm 1 SMALLTALK LM training. 1: Train the routers 2: X N new sequences from the dataset 3: X1:E = random assignments(X) 4: for i = 1 . . . T do 5: for e = 1 . . . E do 6: θr,e arg minθr,e L(Xe; θr,e) # Optimize Equation 9 with SGD for the e-th router 7: end for 8: X N new sequences from the dataset 9: X1:E = balanced assignments(X, θr) # Segment the data according to Equation 4 10: end for 11: Train the experts 12: X M new sequences from the dataset comprising the total number of training tokens 13: X1:E = balanced assignments(X, θr) 14: for e = 1 . . . E do 15: θe arg minθe L(Xe; θe) # Optimize Equation 1 with SGD for the e-th expert 16: end for
Open Source Code No The paper does not provide an explicit statement about releasing their code or a link to a code repository for the SMALLTALK LM methodology. It mentions using PyTorch (Paszke et al., 2019) and Jax (Bradbury et al., 2018), which are third-party frameworks.
Open Datasets Yes We use the Red Pajama-V2 dataset (Computer, 2023), which is a large-scale collection of text data designed for training language models. The dataset is built from the ground up based on publicly available web data, consisting of 84 crawls provided by Common Crawl (2024).
Dataset Splits No The paper mentions using a "held-out test set" and that the training data is "segmented" and "balanced assignments" are made to experts. However, it does not specify explicit percentages or sample counts for training, validation, and test splits for the overall dataset, nor does it refer to standard predefined splits with specific details for reproducibility.
Hardware Specification No The paper mentions "# GPUs" in Table 2, indicating the number of GPUs used for training, but does not specify the type or model of these GPUs (e.g., NVIDIA A100, Tesla V100), nor does it mention any CPU specifications or other hardware details.
Software Dependencies No We implemented our EM scheme for training the routers in Py Torch (Paszke et al., 2019). After segmenting the training set, the experts were trained independently using Jax (Bradbury et al., 2018). We use a Sentence Piece (Kudo & Richardson, 2018) tokenizer with a vocabulary of 32000 tokens. For the evaluation, we utilized the lm-eval-harness software (Gao et al., 2024). While the paper names software and cites their respective publications, it does not provide specific version numbers for any of these components.
Experiment Setup Yes We train the models using the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.99, and a weight decay of 0.1. Gradient clipping is applied with a maximum norm of 0.1. For the experts, we employ a learning rate schedule featuring a linear warm-up to 5 10 4 over the first 3,000 steps, followed by a cosine decay for the remainder of the training period. The routers are trained for 128,000 steps using a constant learning rate of 1 10 4, following a linear warm-up over the first 1,000 steps. Routers are trained with a batch size of 32. All experiments use sequences of 1,024 tokens. Table 2 provides detailed training parameters including steps, tokens, batch size, and number of GPUs for various model configurations.