reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Neutral residues: revisiting adapters for model extension

Authors: Franck Signe Talla, Edouard Grave, Herve Jegou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we report experimental results to validate the performance of neutral residues to adapt a neural network to a new domain. We consider the case of adding or improving the multilingual ability of a large language model. More precisely, we start from an English-only model (or a model that has only seen a small amount of non-English data) and finetune it on another target language. Section F covers the case of multiple languages for a multilingual model.
Researcher Affiliation	Industry	1Kyutai, Paris, France. Correspondence to: Franck SIGNE TALLA <EMAIL>.
Pseudocode	No	The paper describes the methods in narrative text and uses architectural diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	For the multi-lingual finetuning datasets, we use data extracted from Common Crawl... For the English domain, we restricted ourselves to text from Wikipedia... For English, we use text from a domain that does not correspond to the finetuning data: the Pub Med subset from The Pile (Gao et al., 2020). Second, we consider standard academic benchmarks used to evaluate large language models, such as question answering or Cloze-style problems. We use the following datasets: ARC challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), Common Sense QA (Talmor et al., 2018) and Belebele (Bandarkar et al., 2023). For target languages, we use the translated datasets from Dac Lai et al. (2023) and Sakai et al. (2024), when they exist.
Dataset Splits	No	The paper mentions 'held-out sets' for perplexity evaluation and discusses the 'ratio p of data' for mixed training (e.g., 'Training data consists of 90% balanced learned languages and 10% balanced retained languages (p = 0.1)'). However, it does not provide specific train/validation/test split percentages or sample counts for the datasets used in the experiments.
Hardware Specification	No	The paper mentions '16,000 H100 GPUs' in the introduction when discussing the estimated cost of training the Llama 3 model, but this refers to a third-party model and not the specific hardware used for the experiments conducted in this paper.
Software Dependencies	No	The paper mentions using the 'resiliparse package' and 'fastText' for data processing. However, it does not specify version numbers for these or any other software libraries or frameworks used in the experiments.
Experiment Setup	Yes	Hyperparameters. Except mentioned otherwise, Lo RA, adapters, and neutral residues use 20% of extra learnable weights. For each method, we selected the learning rate that leads to the best trade-off between learning and forgetting: 5 × 10−5 for finetuning and Lo RA, 2 × 10−4 for adapters and neutral residues. The hyperparameter α governing the strength of the sparsity loss is set by default to 0.01 for neutral residues. When training on the new data, we train during 100,000 steps with a batch size of 64 sequences of length 4,096 for both EN-LM-1B and Gemma-2B. We provide other training hyperparameters in Section B.