reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adapting Chat Language Models Using Only Target Unlabeled Language Data

Authors: Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the efficacy of El Chat by experimenting with two popular chat models across seven typologically diverse languages. Our evaluation includes safety, chat, and instruction-following performance. Additionally, we also assess target and source language task performance and target language inference speed. Our key contributions are as follows: We propose El Chat that adapts a chat model directly on target unlabeled data, eliminating the need for (i) a base model and (ii) target chat data. El Chat achieves better chat and instruction-following abilities and source language performance than CV. It is also competitive and more robust (i.e., consistently outperforming the source chat model) in the target language and safety tasks compared to CV ( 5.2, 5.3, 5.1). Despite model modifications, El Chat achieves comparable target inference speedups across models and tasks, matching the performance of the adapted VE and CV models ( 6).
Researcher Affiliation	Collaboration	Atsuki Yamaguchi EMAIL University of Sheffield Terufumi Morishita Hitachi, Ltd. Aline Villavicencio EMAIL University of Exeter University of Sheffield Nikolaos Aletras EMAIL University of Sheffield
Pseudocode	No	The paper describes methods through textual descriptions and a flowchart (Figure 2), but does not contain a formally structured pseudocode or algorithm block.
Open Source Code	Yes	1Our code is available on Git Hub. The adapted models are available on Hugging Face Hub.
Open Datasets	Yes	For the CPT part of VE, we use MADLAD-400 (Kudugunta et al., 2023), which consists of highly-filtered document-level samples sourced from Common Crawl, and randomly sample 250K language-specific documents for each language as the target unlabeled data. Following Cahyawijaya et al. (2024), we conduct safety evaluation on target language translated data including Truthful QA (Lin et al., 2022), Toxic Gen (Hartvigsen et al., 2022), and Implicit Hate (El Sherief et al., 2021). We also measure chat and instruction-following abilities in the source language (English) using IFEval (Zhou et al., 2023), GSM8K (Cobbe et al., 2021) as multi-turn few-shot, and MT-Bench (Zheng et al., 2023). Furthermore, we measure the performance on English Alpaca Eval v2.0 (Li et al., 2023; Dubois et al., 2024) for additional analysis. We use multi-turn MGSM (Shi et al., 2023) for target language evaluation as it consists of manually translated high-quality data. For generative tasks, we use summarization (sum) using XL-SUM (Hasan et al., 2021) and English-to-target machine translation (mt) using FLORES-200 (NLLB Team et al., 2022). For a discriminative task, we employ multiple-choice text classification (mc) using Belebele (Bandarkar et al., 2024) and Global MMLU (gmmlu) (Singh et al., 2025) as general target language understanding benchmarks. We also use mmlu (Hendrycks et al., 2021) as an English language understanding benchmark and English bbh (Srivastava et al., 2023; Suzgun et al., 2023) as a stress-test benchmark.
Dataset Splits	Yes	For the CPT part of VE, we use MADLAD-400 (Kudugunta et al., 2023), which consists of highly-filtered document-level samples sourced from Common Crawl, and randomly sample 250K language-specific documents for each language as the target unlabeled data... Following Ahia et al. (2023), we use 500 random samples for generative tasks: sum and mt. The rest use the full test sets for evaluation. We report average zeroand three-shot performance across three different runs for sum and mt, respectively. For the remaining tasks, we report single-run zero-shot performance for IFEval, MT-Bench, Toxic Gen and Implicit Hate, three-shot performance for mc, Truthful QA, five-shot performance for gmmlu, mmlu, GSM8K, and MGSM as these tasks are deterministically evaluated with temperature set to zero.
Hardware Specification	Yes	We use either a single NVIDIA A100 (80GB), NVIDIA H100 (80GB), or NVIDIA GH200 (96GB) for CPT. For CPT with Qwen3 14B, we use a single AMD MI300X GPU. For evaluation, we use a single NVIDIA A100 (80GB) for all Llama 3.1 models, a single NVIDIA H100 (80GB) for all Qwen2.5 models, and a single AMD MI300X GPU for all Qwen3 models to ensure accurate measurement of inference efficiency.
Software Dependencies	No	We implement our models using Py Torch (Ansel et al., 2024) and Hugging Face Transformers (Wolf et al., 2020). Tokenizer Training. We train tokenizers using Hugging Face Tokenizers. Preprocessing. We preprocess datasets with Hugging Face Datasets (Lhoest et al., 2021). Our experimental design is based on the findings from Tejaswi et al. (2024).
Experiment Setup	Yes	Table 4: Hyperparameters for continual pre-training. Hyperparameters Values Batch size 32 Number of training steps 30,517 Adam ϵ 1e-8 Adam β1 0.9 Adam β2 0.999 Sequence length 512 Learning rate 5e-5 Learning rate scheduler cosine Warmup steps First 5% of steps Weight decay 0.01 Attention dropout 0.0 Training precision BF16. Table 5: Parameters for non-greedy generative tasks: mt and sum. Parameters Values Temperature 0.8 Repetition penalty 1.1 Top k 40 Top p 0.9 Beam width 5 Sampling True Early stopping True Maximum number of generated tokens 128