reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

Authors: HyoJung Han, Akiko Eriguchi, Haoran Xu, Hieu Hoang, Marine Carpuat, Huda Khayrallah

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across 11 languages with diverse scripts, resource availability, and fragmentation we demonstrate that Voc ADT outperforms the original Mistral model (Jiang et al., 2023) and other baselines across various multilingual tasks including natural language understanding and machine translation. We demonstrate the effectiveness of our adaptation method on various NLP tasks spanning Natural Language Understanding and MT. Results show that our approach consistently surpasses the original Mistral model in most cases, both after the adaptation phase and following phase of full-weight training.
Researcher Affiliation	Collaboration	Hyo Jung Han University of Maryland EMAIL Akiko I. Eriguchi Microsoft EMAIL Haoran Xu Microsoft EMAIL Hieu Hoang Microsoft EMAIL Marine Carpuat University of Maryland EMAIL Huda Khayrallah Amazon EMAIL
Pseudocode	No	The paper describes the methodology using mathematical formulas and conceptual diagrams (Figure 1b), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Project page: https://github.com/h-j-han/Voc ADT. Models at Huggingface Hub
Open Datasets	Yes	For MT of English to non-English (en-xx) and non-English to English (xx-en), we use FLORES (Goyal et al., 2022; NLLB Team et al., 2022) as it supports all the languages that we experiment with. We use five-shot MT prompting for the model from the adaptation phase, and zero-shot prompting for the model after the ALMA training phase. We assess the translation quality with x COMET-XL (Guerreiro et al., 2023), which produces a score of increasing quality ranging from 0 to 1. For NLI and reasoning, we use XNLI (Conneau et al., 2018) and XCOPA (Ponti et al., 2020) with zero-shot prompting. For multiple choice QA, we use Belebele (Bandarkar et al., 2024) and Multilingual MMLU (Hendrycks et al., 2021; Lai et al., 2023, MMMLU) with five shot prompting. All the tasks except for MT are classification tasks, where we use the lm-evaluation-harness (Gao et al., 2024) evaluation tool and report accuracy.
Dataset Splits	Yes	1) In monolingual fine-tuning, we use MADLAD-400. ... 2) In the next parallel training, we sample 15k bitext from NLLB dataset (Schwenk et al., 2021b; Heffernan et al., 2022; NLLB Team et al., 2022)7 for each English and non-English training pairs with top LASER3 scores (Artetxe & Schwenk, 2019). The parallel training is done for one epoch, and we report test set numbers with the best model of the validation set. All the models are fine-tuned and tested with both directions of en-xx and xx-en within a single model, meaning there are no separate models for opposite translation directions. We follow the prompting strategy of Xu et al. (2024). We use five-shot MT prompting for the model from the adaptation phase, and zero-shot prompting for the model after the ALMA training phase.
Hardware Specification	Yes	We use four Nvidia A100 GPUs for adapter training and 16 AMD MI200 GPUs for full-weight fine-tuning.
Software Dependencies	No	The paper mentions several tools and libraries like Sentence Piece, lm-evaluation-harness, LASER3, but does not provide specific version numbers for these or other key software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We train Sentence Piece (Kudo & Richardson, 2018) tokenizers on either language-specific corpora or a combined corpus, with a maximum of 2 million tokens per language, and create new vocabularies with a size of 50k for all cases including mono/multilingual vocabularies (\|V n\| = 50k). ... We train 0.5B monolingual tokens per language, totaling 2.5B mixed by 5 languages (English + 4 non-English from each corresponding group), and report test numbers from it. We set the weighing factor of auxiliary loss α with 0.1 for non-Latin groups and 0 for the Latin group unless otherwise specified. ... In adapter training for Voc ADT, we use a (peak) learning rate of 2e-6 with a cosine scheduler, a maximum sequence length of 512 tokens, a warm-up ratio of 0.01, and a weight decay of 0.01. In full-weight fine-tuning phase, we mostly follow the training setting from ALMA.