Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models
Authors: Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. ... 5 EMPIRICAL RESULTS |
| Researcher Affiliation | Collaboration | Lucas Bandarkar Benjamin Muller Pritish Yuvraj Rui Hou Nayan Singhal Hongjiang Lv Bing Liu Meta Gen AI University of California, Los Angeles |
| Pseudocode | Yes | Algorithm 1 Layer Swapping |
| Open Source Code | No | The paper states in the Reproducibility section: "We provide pseudocode of the layer swapping algorithm defined in Section 4." This mentions pseudocode, not open-source code for the implementation, and no link is provided. |
| Open Datasets | Yes | Table 3: Datasets used for supervised-fine-tuning (SFT) in this project ... Orca Math word problems dataset from Microsoft (Mitra et al., 2024) https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k ... Aya Dataset from Cohere for AI (Singh et al., 2024a) https://huggingface.co/datasets/Cohere For AI/aya_dataset |
| Dataset Splits | No | We perform SFT runs using next token prediction with 30-40k labeled samples with varying hyperparameters. (This specifies the total sample count but not how they are split for training, validation, or testing for their SFT process itself.) |
| Hardware Specification | No | The paper discusses model training and evaluation but does not specify the hardware (e.g., GPU models, CPU types, memory) used for these experiments. |
| Software Dependencies | No | The paper provides hyperparameters for fine-tuning in Appendix A.2 but does not list any specific software dependencies (e.g., libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | Table 4: Hyperparameters for the training runs that led to each of our experts (lists specific values for Learn Rate, Seq. Length, weight decay, clip, max norm, sched. warmup steps, and decay rate). It also states: "Note that in all runs, we do checkpointing every 5000 samples and use a different random seed for data sampling." |