Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

Authors: Noah Flynn

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLUPro X), including unseen long-context tasks (One Ruler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.
Researcher Affiliation Academia Noah Flynn EMAIL UC Berkeley
Pseudocode Yes Algorithm 1 COMPASS: Distribution-Guided Auxiliary Data Sampling
Open Source Code No The paper does not provide an explicit statement or a direct link to a source-code repository for the COMPASS methodology described.
Open Datasets Yes Aya Dataset (Singh et al., 2024): Our primary fine-tuning data source serving as the pool for Daux. Aya is a large open multilingual instruction tuning dataset... Global MMLU (Singh et al., 2025): For primary evaluation of COMPASS, we use Global-MMLU... MMLU-Pro X (Xuan et al., 2025): MMLU-Pro X offers another challenging evaluation dataset.
Dataset Splits Yes We set Et as the dev set of Global-MMLU or MMLUPro X questions in that language (for distribution analysis) and use the test sets of the evaluation benchmarks (detailed below) for final scoring. Global MMLU... a dev set of 285 instances per language. MMLU-Pro X... use each language s available dev set and the test set of MMLU-Pro X-Lite as a proxy for live data to tune adapters on COMPASS-derived Aya training data, evaluating them on test samples from the full MMLU-Pro X data set (minus the test samples contained within MMLU-Pro X-Lite).
Hardware Specification Yes Using Jina-Embeddings-v3-570M (A100 GPU, batch size 128), we embed 204K examples from Aya dataet in 42.4 minutes (averaged over 3 embedding runs).
Software Dependencies No The paper mentions software components like 'Adam W optimizer', 'Do RA' and 'Lo RA', and specific embedding and language identification models ('Jina-Embeddings-v3-570M', 'Glot LID-v3'), but it does not provide specific version numbers for general libraries or frameworks like Python, PyTorch, or the mentioned optimizers to ensure reproducibility of the software environment.
Experiment Setup Yes We use Adam W optimizer (β1 = 0.9, β2 = 0.999), weight decay of 0.1, gradient clipping of 1, batch size of 128, and with learning rate of 2e-4 for Phi-4-mini and 1e-4 for LLa MA-8B and Qwen2.5-7B, with a 0.1 warmup ratio and cosine scheduler in each setting. We limit fine-tuning to 3 epochs with early stopping. ... rank r = 16 for Phi-4-mini and Qwen2.5-7B adapters, and r = 8 for LLa MA-8B.