Enhancing Non-English Capabilities of English-Centric Large Language Models Through Deep Supervision Fine-Tuning

Authors: Wenshuai Huo, Xiaocheng Feng, Yichong Huang, Chengpeng Fu, Baohang Li, Yangfan Ye, Zhirui Zhang, Dandan Tu, Duyu Tang, Yunfei Lu, Hui Wang, Bing Qin

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive experiments on typical English-centric LLMs, LLa MA-2 and Gemma-2. The results on 8 multilingual datasets show that our method significantly outperforms traditional fine-tuning methods.
Researcher Affiliation Collaboration 1Harbin Institution of Technology 2Pengcheng Laboratory 3Huawei Technologies Co., Ltd
Pseudocode No The paper describes methods in text and uses diagrams (Figure 1, Figure 2) to illustrate the overall process, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No Our code implementation is based on stanford alpaca 1https://github.com/tatsu-lab/stanford alpaca.
Open Datasets Yes XQUAD (Artetxe, Ruder, and Yogatama 2019) is a high-quality cross-lingual question answering dataset; MLQA (Lewis et al. 2019) is a multilingual question answering dataset; MKQA (Longpre, Lu, and Daiber 2021) dataset contains 2,600 common sense question-answer pairs; Truthful QA (Lin, Hilton, and Evans 2021) includes questions from various domains; XNLI (Conneau et al. 2018) is a widely used language understanding dataset; XCOPA (Ponti et al. 2020) is a benchmark designed to evaluate the ability of models to apply commonsense reasoning; XStory Cloze (Lin et al. 2022) is a cross-lingual dataset for evaluating models ability to understand stories; MMLU (Hendrycks et al. 2020) is a large-scale multitask language understanding dataset. The training data consists of Stanford Alpaca instruction data (Taori et al. 2023).
Dataset Splits No The training data consists of Stanford Alpaca instruction data (Taori et al. 2023) and its translations in the target languages, which include Chinese (zh), Vietnamese (vi), and Arabic (ar). For the translated data, we directly used publicly available datasets from (Zhu et al. 2023b). For all evaluation datasets, we conducted tests using a zero-shot setting. The paper does not specify explicit train/validation/test splits for the instruction data used for fine-tuning.
Hardware Specification Yes All experiments were conducted on 8 A100 GPUs with a batch size of 128.
Software Dependencies No Our code implementation is based on stanford alpaca 1. To accelerate training, we utilized the FSDP training strategy (Zhao et al. 2023). The paper mentions basing their code on Stanford Alpaca and using FSDP, but does not provide specific version numbers for any software libraries or frameworks.
Experiment Setup Yes The models were trained for 3 epochs with a learning rate of 2e-5.