Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
Authors: Noah Flynn
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLUPro X), including unseen long-context tasks (One Ruler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments. |
| Researcher Affiliation | Academia | Noah Flynn EMAIL UC Berkeley |
| Pseudocode | Yes | Algorithm 1 COMPASS: Distribution-Guided Auxiliary Data Sampling |
| Open Source Code | No | The paper does not provide an explicit statement or a direct link to a source-code repository for the COMPASS methodology described. |
| Open Datasets | Yes | Aya Dataset (Singh et al., 2024): Our primary fine-tuning data source serving as the pool for Daux. Aya is a large open multilingual instruction tuning dataset... Global MMLU (Singh et al., 2025): For primary evaluation of COMPASS, we use Global-MMLU... MMLU-Pro X (Xuan et al., 2025): MMLU-Pro X offers another challenging evaluation dataset. |
| Dataset Splits | Yes | We set Et as the dev set of Global-MMLU or MMLUPro X questions in that language (for distribution analysis) and use the test sets of the evaluation benchmarks (detailed below) for final scoring. Global MMLU... a dev set of 285 instances per language. MMLU-Pro X... use each language s available dev set and the test set of MMLU-Pro X-Lite as a proxy for live data to tune adapters on COMPASS-derived Aya training data, evaluating them on test samples from the full MMLU-Pro X data set (minus the test samples contained within MMLU-Pro X-Lite). |
| Hardware Specification | Yes | Using Jina-Embeddings-v3-570M (A100 GPU, batch size 128), we embed 204K examples from Aya dataet in 42.4 minutes (averaged over 3 embedding runs). |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer', 'Do RA' and 'Lo RA', and specific embedding and language identification models ('Jina-Embeddings-v3-570M', 'Glot LID-v3'), but it does not provide specific version numbers for general libraries or frameworks like Python, PyTorch, or the mentioned optimizers to ensure reproducibility of the software environment. |
| Experiment Setup | Yes | We use Adam W optimizer (β1 = 0.9, β2 = 0.999), weight decay of 0.1, gradient clipping of 1, batch size of 128, and with learning rate of 2e-4 for Phi-4-mini and 1e-4 for LLa MA-8B and Qwen2.5-7B, with a 0.1 warmup ratio and cosine scheduler in each setting. We limit fine-tuning to 3 epochs with early stopping. ... rank r = 16 for Phi-4-mini and Qwen2.5-7B adapters, and r = 8 for LLa MA-8B. |