Enhancing Non-English Capabilities of English-Centric Large Language Models Through Deep Supervision Fine-Tuning
Authors: Wenshuai Huo, Xiaocheng Feng, Yichong Huang, Chengpeng Fu, Baohang Li, Yangfan Ye, Zhirui Zhang, Dandan Tu, Duyu Tang, Yunfei Lu, Hui Wang, Bing Qin
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments on typical English-centric LLMs, LLa MA-2 and Gemma-2. The results on 8 multilingual datasets show that our method significantly outperforms traditional fine-tuning methods. |
| Researcher Affiliation | Collaboration | 1Harbin Institution of Technology 2Pengcheng Laboratory 3Huawei Technologies Co., Ltd |
| Pseudocode | No | The paper describes methods in text and uses diagrams (Figure 1, Figure 2) to illustrate the overall process, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | Our code implementation is based on stanford alpaca 1https://github.com/tatsu-lab/stanford alpaca. |
| Open Datasets | Yes | XQUAD (Artetxe, Ruder, and Yogatama 2019) is a high-quality cross-lingual question answering dataset; MLQA (Lewis et al. 2019) is a multilingual question answering dataset; MKQA (Longpre, Lu, and Daiber 2021) dataset contains 2,600 common sense question-answer pairs; Truthful QA (Lin, Hilton, and Evans 2021) includes questions from various domains; XNLI (Conneau et al. 2018) is a widely used language understanding dataset; XCOPA (Ponti et al. 2020) is a benchmark designed to evaluate the ability of models to apply commonsense reasoning; XStory Cloze (Lin et al. 2022) is a cross-lingual dataset for evaluating models ability to understand stories; MMLU (Hendrycks et al. 2020) is a large-scale multitask language understanding dataset. The training data consists of Stanford Alpaca instruction data (Taori et al. 2023). |
| Dataset Splits | No | The training data consists of Stanford Alpaca instruction data (Taori et al. 2023) and its translations in the target languages, which include Chinese (zh), Vietnamese (vi), and Arabic (ar). For the translated data, we directly used publicly available datasets from (Zhu et al. 2023b). For all evaluation datasets, we conducted tests using a zero-shot setting. The paper does not specify explicit train/validation/test splits for the instruction data used for fine-tuning. |
| Hardware Specification | Yes | All experiments were conducted on 8 A100 GPUs with a batch size of 128. |
| Software Dependencies | No | Our code implementation is based on stanford alpaca 1. To accelerate training, we utilized the FSDP training strategy (Zhao et al. 2023). The paper mentions basing their code on Stanford Alpaca and using FSDP, but does not provide specific version numbers for any software libraries or frameworks. |
| Experiment Setup | Yes | The models were trained for 3 epochs with a learning rate of 2e-5. |