Capability Instruction Tuning

Authors: Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments This section begins by detailing the construction of training and test instructions in capability instructions tuning. It then presents different zoo setups for testing and concludes with an analysis of results and ablation studies. ... Table 1: A Comprehensive Performance Evaluation: Covering smaller-scale, high-performance giant LLMs, and a mixed LLM zoo of small, medium, and large levels. Model-SAT performs instruction-level model selection, consistently maintaining efficient and precise results that outperform the optimal one in the LLM zoo.
Researcher Affiliation Academia Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye* School of Artificial Intelligence, Nanjing University National Key Laboratory for Novel Software Technology, Nanjing University EMAIL
Pseudocode No The paper describes the MODEL-SAT framework and its components, and outlines the tuning recipe and deployment, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Now-Join-Us/CIT-LLM-Routing.
Open Datasets Yes The model s capability representation is formed from 50 distinct tasks across various categories from the MMLU dataset, with each task being 20-shot. ... Datasets including MMLU (Hendrycks et al. 2021) (5-shot) and Wino Grande (Sakaguchi et al. 2020) (5-shot) ... On the other hand, datasets such as ARCChallenge (Bhakthavatsalam et al. 2021) (25-shot), Truthful QA (6-shot), and Bool Q (Clark et al. 2019) (1-shot) with MRPC (1-shot) and MNLI (1-shot) in GLUE (Wang et al. 2019) benchmark ... We consider the evaluation datasets MMMU-VAL (Hendrycks et al. 2020), AI2D-TEST (Kembhavi et al. 2016), and CCBench (Liu et al. 2024) in the multimodal scenario.
Dataset Splits No The paper describes how core tasks are sampled and how positive and negative instructions are generated for training, but it does not specify explicit training/validation/test splits (e.g., percentages or exact sample counts) for the overall MODEL-SAT framework's evaluation.
Hardware Specification No The paper discusses computational resources in general terms but does not specify any particular hardware components (e.g., GPU models, CPU types) used for conducting the experiments.
Software Dependencies No The paper refers to specific LLM models used as components (E5-Large, Phi-3-Mini) and mentions that MODEL-SAT is built on the Wings training architecture, but it does not list general software dependencies or their specific version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We apply Homogeneous In-Batch Negative Sampling (Karpukhin et al. 2020; Zhang et al. 2023a) for each capability representation cm with its well-performed and poorly-performed instructions to enhance the discriminative during training. Typically, a k-shot training batch Z = {zi}k i=0 contains 1 positive instruction and k 1 negative ones. Loss Design: We denote the position of the positive instruction in the training batch Z as ypos... We employ the cross-entropy loss to optimize this in one batch Z... In the second stage, we fine-tune all model parameters. We apply a larger learning rate on the encoder and connector to enhance capability alignment with instruction semantics.