XCOT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning

Authors: Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xinnian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, Zhoujun Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the superior performance of XCOT in reducing the gap among different languages, highlighting its potential to reduce the cross-lingual gap. Extensive experiments of XCOT are evaluated on multilingual benchmarks MGSM of 11 languages and MSVAMP of 10 languages.
Researcher Affiliation Academia 1The State Key Laboratory of Complex & Critical Software Environment, Beihang University 2Cyberspace Institute of Advanced Technology, Guangzhou University 3Beijing Information Science and Technology University 4School of New Media and Communication, Tianjin University
Pseudocode Yes Algorithm 1: Random Online Co T
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes To comprehensively assess the cross-lingual proficiency of XCOT, we evaluate the method using the MGSM (Shi et al. 2023) benchmark, which extends the English GSM8K (Cobbe et al. 2021) dataset into ten typologically varied languages through the manual translation of problems. ...we also evaluate our method on MSVAMP (Chen et al. 2023), originating from the SVAMP (Patel, Bhattamishra, and Goyal 2021) dataset.
Dataset Splits Yes We create a new multilingual instruction dataset (XCOT-INSTRUCT) for cross-lingual chain-of-thought reasoning, which can be used as the training corpora for multilingual benchmarks, such as MGSM (Shi et al. 2023) and MSVAMP (Chen et al. 2023). ... The instruction dataset of each language contains 7.4K samples... For each language, we construct about 22K data context demonstration samples.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, only mentioning base models like Llama-2-7B and Bloom-7b1.
Software Dependencies No The paper mentions implementing the model based on "Llama-2-7B, Llama-2-13B, and Bloom-7b1" but does not specify versions for ancillary software like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We finetune these models with 3 epochs and use a cosine scheduler with a learning rate of 2e-5 and set 3% warm up. For cross-lingual distillation, the weight β of the distillation loss is set to 0.3.