Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Language Imbalance Driven Rewarding for Multilingual Self-improving
Authors: Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language s capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instructionfollowing and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-Alpaca Eval leaderboard and 13.9% accuracy on the MGSM benchmark. |
| Researcher Affiliation | Academia | 1 School of Artificial Intelligence, University of Chinese Academy of Sciences 2 Institute of Automation, Chinese Academy of Sciences 3 Wuhan AI Research 4 Shanghai Artificial Intelligence Laboratory, Shanghai, China EMAIL EMAIL |
| Pseudocode | No | The paper only describes steps in regular paragraph text without structured formatting. There are mathematical formulas and diagrams but no pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https: //github.com/ZNLP/Language-Imbalance-Driven-Rewarding |
| Open Datasets | Yes | The Alpagasus dataset (Chen et al., 2023a) includes 9,000 high-quality instructionfollowing examples filtered from the original 52,000 in the Alpaca dataset (Taori et al., 2023). We sample 1,000 prompts from the Alpagasus dataset and translate them into other languages using the Google Translate API to obtain multilingual prompts. The arithmetic reasoning task is conducted on Llama-3-8B-Instruct, starting with multilingual GSM8K (Cobbe et al., 2021) prompts. Performance was measured using the MGSM benchmark (Shi et al., 2022), which consists of 250 manually translated GSM8K problems in ten languages. multilingual versions of the MMLU (Hendrycks et al., 2020), Hella Swag (Zellers et al., 2019), ARC Challenge (Clark et al., 2018) and Truthful QA (Lin et al., 2021) benchmarks. |
| Dataset Splits | No | We sample 1,000 prompts from the Alpagasus dataset and translate them into other languages using the Google Translate API to obtain multilingual prompts. Head-to-head performance is evaluated between base model and the iterative models using GPT-4 Turbo as an evaluator (Liu et al., 2023) over 805 test prompts in X-Alpaca Eval (Zhang et al., 2023). We utilize the instructions from the 7,473 training examples and translate them into multiple languages using the Google Translate API to construct the multilingual GSM8K instructions. The paper does not provide specific train/validation/test splits for the data used in their iterative DPO training. |
| Hardware Specification | Yes | All experiments were conducted on Ubuntu 22.04 equipped with 8 NVIDIA A100 GPUs. |
| Software Dependencies | Yes | Our code mainly depends on Python 3.10 and Py Torch 2.3.0. |
| Experiment Setup | Yes | All models are optimized using Adam W (Kingma & Ba, 2014), with a cosine learning rate scheduler that includes a warm-up phase constituting 3% of the total training duration. DPO+NLL runs are trained with KL-penalty β = 0.1. The coefficient α is set to 1 for all experiments in the paper. The details of hyperparameters are shown in Table 16. Table 16: The hyperparameters on various experiments. LR refers to the Learning Rate, and BS denotes the Batch Size. Experiments LR BS Epoch. Language Imbalance Driven Rewarding General Instruction-following Task 5e-7 16 1. Arithmetic reasoning Task 5e-6 64 1. Multilingual Alignment All Tasks 1e-5 128 3. |