Boosting Multi-Domain Fine-Tuning of Large Language Models through Evolving Interactions between Samples
Authors: Xize Liang, Lin Yang, Jie Wang, Yiyang Lu, Runyu Wu, Hanzhu Chen, Jianye Hao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on a mixed dataset covering coding, math, and general tasks with several model architectures show that EVIC significantly outperforms all baselines across diverse capabilities. We conduct extensive experiments on a mixed dataset containing 182,166 samples covering the domains of mathematical reasoning, code generation, and general instruction following with Mistral-7B, Llama-3.1-8B, and Qwen2.5-14B. The evaluation results on GSM8K, Human Eval, and Alpaca Eval 2.0 show that EVIC outperforms all baselines across diverse capabilities. |
| Researcher Affiliation | Collaboration | 1Mo E Key Laboratory of Braininspired Intelligent Perception and Cognition, University of Science and Technology of China 2Noah s Ark Lab, Huawei Technologies 3College of Intelligence and Computing, Tianjin University. |
| Pseudocode | Yes | Algorithm 1 The EVIC framework 1: Input: dataset D = {s(i)}N i=1, number of iterations M, base model πbase 2: Warm-Up and Initialization:... |
| Open Source Code | No | The paper does not explicitly state that the authors are releasing their source code for the EVIC methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use three datasets for each domain: Code Alpaca (Chaudhary, 2023) for code generation, GSM8K-RFT (Cobbe et al., 2021; Yuan et al., 2023) for mathematical reasoning, and Alpaca-GPT4 (Peng et al., 2023) for general instruction following. For details of the datasets, see Appendix A.1. Code Alpaca2 (Chaudhary, 2023) is designed for training code generation models... GSM8k-RFT3 (Cobbe et al., 2021) is employed to evaluate mathematical reasoning... GPT4-Alpaca4 (Peng et al., 2023) is utilized for instruction following... |
| Dataset Splits | Yes | Table 1. Statistics of the datasets and benchmarks for each domain, and settings for model inference and evaluation. Domain Dtrain |Dtrain| Dtest |Dtest| Code Code Alpaca 20,022 Human Eval 164 Math GSM8K-RFT 110,142 GSM8K-test 1,319 General Alpaca-GPT4 52,002 Alpaca Eval 2.0 805 |
| Hardware Specification | Yes | We run all experiments on 8 NVIDIA A100 GPUs (80GB). |
| Software Dependencies | No | The paper mentions using the LLaMAFactory framework (Zheng et al., 2024) and Lo RA (Hu et al., 2021) but does not provide specific version numbers for these or other software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use a learning rate of 2e-5, the cosine learning rate scheduler, a batch size of 128 for all the methods. For Lo RA, we use a rank of 128, α = 512, a dropout ratio of 0.1, and learn Lo RA parameters for all attention matrices. For MTL, DMT, and Mo S, we align their training process to three epochs... |