Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

Authors: Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. ... 5 EMPIRICAL RESULTS
Researcher Affiliation Collaboration Lucas Bandarkar Benjamin Muller Pritish Yuvraj Rui Hou Nayan Singhal Hongjiang Lv Bing Liu Meta Gen AI University of California, Los Angeles
Pseudocode Yes Algorithm 1 Layer Swapping
Open Source Code No The paper states in the Reproducibility section: "We provide pseudocode of the layer swapping algorithm defined in Section 4." This mentions pseudocode, not open-source code for the implementation, and no link is provided.
Open Datasets Yes Table 3: Datasets used for supervised-fine-tuning (SFT) in this project ... Orca Math word problems dataset from Microsoft (Mitra et al., 2024) https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k ... Aya Dataset from Cohere for AI (Singh et al., 2024a) https://huggingface.co/datasets/Cohere For AI/aya_dataset
Dataset Splits No We perform SFT runs using next token prediction with 30-40k labeled samples with varying hyperparameters. (This specifies the total sample count but not how they are split for training, validation, or testing for their SFT process itself.)
Hardware Specification No The paper discusses model training and evaluation but does not specify the hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies No The paper provides hyperparameters for fine-tuning in Appendix A.2 but does not list any specific software dependencies (e.g., libraries, frameworks) with version numbers.
Experiment Setup Yes Table 4: Hyperparameters for the training runs that led to each of our experts (lists specific values for Learn Rate, Seq. Length, weight decay, clip, max norm, sched. warmup steps, and decay rate). It also states: "Note that in all runs, we do checkpointing every 5000 samples and use a different random seed for data sampling."