Neutral residues: revisiting adapters for model extension

Authors: Franck Signe Talla, Edouard Grave, Herve Jegou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we report experimental results to validate the performance of neutral residues to adapt a neural network to a new domain. We consider the case of adding or improving the multilingual ability of a large language model. More precisely, we start from an English-only model (or a model that has only seen a small amount of non-English data) and finetune it on another target language. Section F covers the case of multiple languages for a multilingual model.
Researcher Affiliation Industry 1Kyutai, Paris, France. Correspondence to: Franck SIGNE TALLA <EMAIL>.
Pseudocode No The paper describes the methods in narrative text and uses architectural diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes For the multi-lingual finetuning datasets, we use data extracted from Common Crawl... For the English domain, we restricted ourselves to text from Wikipedia... For English, we use text from a domain that does not correspond to the finetuning data: the Pub Med subset from The Pile (Gao et al., 2020). Second, we consider standard academic benchmarks used to evaluate large language models, such as question answering or Cloze-style problems. We use the following datasets: ARC challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), Common Sense QA (Talmor et al., 2018) and Belebele (Bandarkar et al., 2023). For target languages, we use the translated datasets from Dac Lai et al. (2023) and Sakai et al. (2024), when they exist.
Dataset Splits No The paper mentions 'held-out sets' for perplexity evaluation and discusses the 'ratio p of data' for mixed training (e.g., 'Training data consists of 90% balanced learned languages and 10% balanced retained languages (p = 0.1)'). However, it does not provide specific train/validation/test split percentages or sample counts for the datasets used in the experiments.
Hardware Specification No The paper mentions '16,000 H100 GPUs' in the introduction when discussing the estimated cost of training the Llama 3 model, but this refers to a third-party model and not the specific hardware used for the experiments conducted in this paper.
Software Dependencies No The paper mentions using the 'resiliparse package' and 'fastText' for data processing. However, it does not specify version numbers for these or any other software libraries or frameworks used in the experiments.
Experiment Setup Yes Hyperparameters. Except mentioned otherwise, Lo RA, adapters, and neutral residues use 20% of extra learnable weights. For each method, we selected the learning rate that leads to the best trade-off between learning and forgetting: 5 × 10−5 for finetuning and Lo RA, 2 × 10−4 for adapters and neutral residues. The hyperparameter α governing the strength of the sparsity loss is set by default to 0.01 for neutral residues. When training on the new data, we train during 100,000 steps with a batch size of 64 sequences of length 4,096 for both EN-LM-1B and Gemma-2B. We provide other training hyperparameters in Section B.