reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Boosting Multi-Domain Fine-Tuning of Large Language Models through Evolving Interactions between Samples

Authors: Xize Liang, Lin Yang, Jie Wang, Yiyang Lu, Runyu Wu, Hanzhu Chen, Jianye Hao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on a mixed dataset covering coding, math, and general tasks with several model architectures show that EVIC significantly outperforms all baselines across diverse capabilities. We conduct extensive experiments on a mixed dataset containing 182,166 samples covering the domains of mathematical reasoning, code generation, and general instruction following with Mistral-7B, Llama-3.1-8B, and Qwen2.5-14B. The evaluation results on GSM8K, Human Eval, and Alpaca Eval 2.0 show that EVIC outperforms all baselines across diverse capabilities.
Researcher Affiliation	Collaboration	1Mo E Key Laboratory of Braininspired Intelligent Perception and Cognition, University of Science and Technology of China 2Noah s Ark Lab, Huawei Technologies 3College of Intelligence and Computing, Tianjin University.
Pseudocode	Yes	Algorithm 1 The EVIC framework 1: Input: dataset D = {s(i)}N i=1, number of iterations M, base model πbase 2: Warm-Up and Initialization:...
Open Source Code	No	The paper does not explicitly state that the authors are releasing their source code for the EVIC methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We use three datasets for each domain: Code Alpaca (Chaudhary, 2023) for code generation, GSM8K-RFT (Cobbe et al., 2021; Yuan et al., 2023) for mathematical reasoning, and Alpaca-GPT4 (Peng et al., 2023) for general instruction following. For details of the datasets, see Appendix A.1. Code Alpaca2 (Chaudhary, 2023) is designed for training code generation models... GSM8k-RFT3 (Cobbe et al., 2021) is employed to evaluate mathematical reasoning... GPT4-Alpaca4 (Peng et al., 2023) is utilized for instruction following...
Dataset Splits	Yes	Table 1. Statistics of the datasets and benchmarks for each domain, and settings for model inference and evaluation. Domain Dtrain \|Dtrain\| Dtest \|Dtest\| Code Code Alpaca 20,022 Human Eval 164 Math GSM8K-RFT 110,142 GSM8K-test 1,319 General Alpaca-GPT4 52,002 Alpaca Eval 2.0 805
Hardware Specification	Yes	We run all experiments on 8 NVIDIA A100 GPUs (80GB).
Software Dependencies	No	The paper mentions using the LLaMAFactory framework (Zheng et al., 2024) and Lo RA (Hu et al., 2021) but does not provide specific version numbers for these or other software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use a learning rate of 2e-5, the cosine learning rate scheduler, a batch size of 128 for all the methods. For Lo RA, we use a rank of 128, α = 512, a dropout ratio of 0.1, and learn Lo RA parameters for all attention matrices. For MTL, DMT, and Mo S, we align their training process to three epochs...