reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

XCOT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning

Authors: Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xinnian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, Zhoujun Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the superior performance of XCOT in reducing the gap among different languages, highlighting its potential to reduce the cross-lingual gap. Extensive experiments of XCOT are evaluated on multilingual benchmarks MGSM of 11 languages and MSVAMP of 10 languages.
Researcher Affiliation	Academia	1The State Key Laboratory of Complex & Critical Software Environment, Beihang University 2Cyberspace Institute of Advanced Technology, Guangzhou University 3Beijing Information Science and Technology University 4School of New Media and Communication, Tianjin University
Pseudocode	Yes	Algorithm 1: Random Online Co T
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	To comprehensively assess the cross-lingual proficiency of XCOT, we evaluate the method using the MGSM (Shi et al. 2023) benchmark, which extends the English GSM8K (Cobbe et al. 2021) dataset into ten typologically varied languages through the manual translation of problems. ...we also evaluate our method on MSVAMP (Chen et al. 2023), originating from the SVAMP (Patel, Bhattamishra, and Goyal 2021) dataset.
Dataset Splits	Yes	We create a new multilingual instruction dataset (XCOT-INSTRUCT) for cross-lingual chain-of-thought reasoning, which can be used as the training corpora for multilingual benchmarks, such as MGSM (Shi et al. 2023) and MSVAMP (Chen et al. 2023). ... The instruction dataset of each language contains 7.4K samples... For each language, we construct about 22K data context demonstration samples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, only mentioning base models like Llama-2-7B and Bloom-7b1.
Software Dependencies	No	The paper mentions implementing the model based on "Llama-2-7B, Llama-2-13B, and Bloom-7b1" but does not specify versions for ancillary software like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We finetune these models with 3 epochs and use a cosine scheduler with a learning rate of 2e-5 and set 3% warm up. For cross-lingual distillation, the weight β of the distillation loss is set to 0.3.