reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Non-English Capabilities of English-Centric Large Language Models Through Deep Supervision Fine-Tuning

Authors: Wenshuai Huo, Xiaocheng Feng, Yichong Huang, Chengpeng Fu, Baohang Li, Yangfan Ye, Zhirui Zhang, Dandan Tu, Duyu Tang, Yunfei Lu, Hui Wang, Bing Qin

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments on typical English-centric LLMs, LLa MA-2 and Gemma-2. The results on 8 multilingual datasets show that our method significantly outperforms traditional fine-tuning methods.
Researcher Affiliation	Collaboration	1Harbin Institution of Technology 2Pengcheng Laboratory 3Huawei Technologies Co., Ltd
Pseudocode	No	The paper describes methods in text and uses diagrams (Figure 1, Figure 2) to illustrate the overall process, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	No	Our code implementation is based on stanford alpaca 1https://github.com/tatsu-lab/stanford alpaca.
Open Datasets	Yes	XQUAD (Artetxe, Ruder, and Yogatama 2019) is a high-quality cross-lingual question answering dataset; MLQA (Lewis et al. 2019) is a multilingual question answering dataset; MKQA (Longpre, Lu, and Daiber 2021) dataset contains 2,600 common sense question-answer pairs; Truthful QA (Lin, Hilton, and Evans 2021) includes questions from various domains; XNLI (Conneau et al. 2018) is a widely used language understanding dataset; XCOPA (Ponti et al. 2020) is a benchmark designed to evaluate the ability of models to apply commonsense reasoning; XStory Cloze (Lin et al. 2022) is a cross-lingual dataset for evaluating models ability to understand stories; MMLU (Hendrycks et al. 2020) is a large-scale multitask language understanding dataset. The training data consists of Stanford Alpaca instruction data (Taori et al. 2023).
Dataset Splits	No	The training data consists of Stanford Alpaca instruction data (Taori et al. 2023) and its translations in the target languages, which include Chinese (zh), Vietnamese (vi), and Arabic (ar). For the translated data, we directly used publicly available datasets from (Zhu et al. 2023b). For all evaluation datasets, we conducted tests using a zero-shot setting. The paper does not specify explicit train/validation/test splits for the instruction data used for fine-tuning.
Hardware Specification	Yes	All experiments were conducted on 8 A100 GPUs with a batch size of 128.
Software Dependencies	No	Our code implementation is based on stanford alpaca 1. To accelerate training, we utilized the FSDP training strategy (Zhao et al. 2023). The paper mentions basing their code on Stanford Alpaca and using FSDP, but does not provide specific version numbers for any software libraries or frameworks.
Experiment Setup	Yes	The models were trained for 3 epochs with a learning rate of 2e-5.