reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Capability Instruction Tuning

Authors: Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments This section begins by detailing the construction of training and test instructions in capability instructions tuning. It then presents different zoo setups for testing and concludes with an analysis of results and ablation studies. ... Table 1: A Comprehensive Performance Evaluation: Covering smaller-scale, high-performance giant LLMs, and a mixed LLM zoo of small, medium, and large levels. Model-SAT performs instruction-level model selection, consistently maintaining efficient and precise results that outperform the optimal one in the LLM zoo.
Researcher Affiliation	Academia	Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye* School of Artificial Intelligence, Nanjing University National Key Laboratory for Novel Software Technology, Nanjing University EMAIL
Pseudocode	No	The paper describes the MODEL-SAT framework and its components, and outlines the tuning recipe and deployment, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/Now-Join-Us/CIT-LLM-Routing.
Open Datasets	Yes	The model s capability representation is formed from 50 distinct tasks across various categories from the MMLU dataset, with each task being 20-shot. ... Datasets including MMLU (Hendrycks et al. 2021) (5-shot) and Wino Grande (Sakaguchi et al. 2020) (5-shot) ... On the other hand, datasets such as ARCChallenge (Bhakthavatsalam et al. 2021) (25-shot), Truthful QA (6-shot), and Bool Q (Clark et al. 2019) (1-shot) with MRPC (1-shot) and MNLI (1-shot) in GLUE (Wang et al. 2019) benchmark ... We consider the evaluation datasets MMMU-VAL (Hendrycks et al. 2020), AI2D-TEST (Kembhavi et al. 2016), and CCBench (Liu et al. 2024) in the multimodal scenario.
Dataset Splits	No	The paper describes how core tasks are sampled and how positive and negative instructions are generated for training, but it does not specify explicit training/validation/test splits (e.g., percentages or exact sample counts) for the overall MODEL-SAT framework's evaluation.
Hardware Specification	No	The paper discusses computational resources in general terms but does not specify any particular hardware components (e.g., GPU models, CPU types) used for conducting the experiments.
Software Dependencies	No	The paper refers to specific LLM models used as components (E5-Large, Phi-3-Mini) and mentions that MODEL-SAT is built on the Wings training architecture, but it does not list general software dependencies or their specific version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	We apply Homogeneous In-Batch Negative Sampling (Karpukhin et al. 2020; Zhang et al. 2023a) for each capability representation cm with its well-performed and poorly-performed instructions to enhance the discriminative during training. Typically, a k-shot training batch Z = {zi}k i=0 contains 1 positive instruction and k 1 negative ones. Loss Design: We denote the position of the positive instruction in the training batch Z as ypos... We employ the cross-entropy loss to optimize this in one batch Z... In the second stage, we fine-tune all model parameters. We apply a larger learning rate on the encoder and connector to enhance capability alignment with instruction semantics.