reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

Authors: Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The resulting merged models outperform the individual experts and other merging methods on the math benchmark, MGSM, by 10% across four major languages where math instruction data is scarce. ... 5 EMPIRICAL RESULTS
Researcher Affiliation	Collaboration	Lucas Bandarkar Benjamin Muller Pritish Yuvraj Rui Hou Nayan Singhal Hongjiang Lv Bing Liu Meta Gen AI University of California, Los Angeles
Pseudocode	Yes	Algorithm 1 Layer Swapping
Open Source Code	No	The paper states in the Reproducibility section: "We provide pseudocode of the layer swapping algorithm defined in Section 4." This mentions pseudocode, not open-source code for the implementation, and no link is provided.
Open Datasets	Yes	Table 3: Datasets used for supervised-fine-tuning (SFT) in this project ... Orca Math word problems dataset from Microsoft (Mitra et al., 2024) https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k ... Aya Dataset from Cohere for AI (Singh et al., 2024a) https://huggingface.co/datasets/Cohere For AI/aya_dataset
Dataset Splits	No	We perform SFT runs using next token prediction with 30-40k labeled samples with varying hyperparameters. (This specifies the total sample count but not how they are split for training, validation, or testing for their SFT process itself.)
Hardware Specification	No	The paper discusses model training and evaluation but does not specify the hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies	No	The paper provides hyperparameters for fine-tuning in Appendix A.2 but does not list any specific software dependencies (e.g., libraries, frameworks) with version numbers.
Experiment Setup	Yes	Table 4: Hyperparameters for the training runs that led to each of our experts (lists specific values for Learn Rate, Seq. Length, weight decay, clip, max norm, sched. warmup steps, and decay rate). It also states: "Note that in all runs, we do checkpointing every 5000 samples and use a different random seed for data sampling."