KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models

Authors: Fan Wang, Juyong Jiang, Chansung Park, Sunghun Kim, Jing Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments across a range of LLMs on tasks spanning natural language understanding (NLU), generation (NLG), instruction following, and commonsense reasoning. The experimental results demonstrate that Ka SA consistently outperforms FFT and 14 popular PEFT baselines across 16 benchmarks and 4 synthetic datasets, underscoring our method s efficacy and adaptability.
Researcher Affiliation Academia The Hong Kong University of Science and Technology (Guangzhou) Electronics and Telecommunications Research Institute The Hong Kong University of Science and Technology EMAIL EMAIL
Pseudocode Yes We present the Py Torch-style pseudocode for Ka SA and its training objective in Appendix A.
Open Source Code Yes The source code of our method is available at https://github.com/juyongjiang/Ka SA.
Open Datasets Yes We conduct extensive experiments to fine-tune LLMs of varying sizes and architectures across various tasks, including natural language understanding (NLU), natural language generation (NLG), instruction following, and commonsense reasoning. ... on 16 benchmarks and 4 synthetic datasets... We make all high-quality synthetic instruction-following datasets generated by GPT4o publicly available 1, enabling the community to enhance the functionality of PEFT and support future research endeavors. 1https://huggingface.co/llama-duo ... For NLU tasks, we evaluate Ka SA with Ro BERTa (Liu et al., 2021) and De BERTa V3 (He et al., 2022b) on the GLUE (Wang et al., 2018) benchmark. ... For NLG tasks, we assess our method with GPT-2 (Radford et al., 2019) on the E2E NLG Challenge (Novikova et al., 2017) benchmark. We further assess the instruction following performance using well-known LLMs, including LLa MA3 8B (Meta, 2024), Mistal 7B (Jiang et al., 2023), Gemma 7B (Gemma Team, 2024), and LLa MA2 13B (Touvron et al., 2023b). ... Additionally, we fine-tune using the Alpaca dataset (Taori et al., 2023b) and report evaluation results on MT-Bench, with GPT4 serving as the judge, yielding scores within 10. Additionally, we substantiate Ka SA s generality by fine-tuning LLa MA2 7B and LLa MA3 8B models on the Commonsense170K dataset (Hu et al., 2023), which includes training sets from eight commonsense reasoning datasets, and evaluating them on individual test sets of these constituent datasets.
Dataset Splits Yes The GLUE benchmark encompasses a wide array of datasets designed to test various aspects of NLU, including question answering, natural language inference, sentiment analysis, and textual entailment. In this context, our evaluation is conducted across 6 datasets from the GLUE: SST-2, MRPC, Co LA, QNLI, RTE, and STS-B. Detailed statistical information about the GLUE benchmark can be found in Appendix C.1. ... For natural language generation (NLG), we utilize the E2E (End-to-End) NLG Challenge dataset (Novikova et al., 2017), which is commonly used for the evaluation of natural language generation models. This dataset includes approximately 42k training samples, 4.6k validation samples, and 4.6k test samples from the restaurant domain. ... Table 5 presents the volume of data samples and token-level statistical information for these task-specific synthetic subsets. ... The Commonsense170K dataset (Hu et al., 2023), which includes training sets from eight commonsense reasoning datasets, and evaluating them on individual test sets of these constituent datasets.
Hardware Specification Yes All experiments are conducted on NVIDIA A100-SXM4 (80GB) GPUs, except for the NLU experiments, which are conducted on NVIDIA Ge Force RTX 3090 (24GB) GPUs.
Software Dependencies No The paper provides Py Torch-style pseudocode in Appendix A, indicating the use of PyTorch. However, it does not specify explicit version numbers for PyTorch or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes Implementation Details. Basically, we follow the experimental setup applied in (Hu et al., 2021; Zhang et al., 2022) to ensure a fair comparison. We randomly initialize the knowledge-aware singular values without bias, which only introduces negligible r coefficients in each layer. For all evaluated datasets in GLUE, we meticulously tune the hyperparameters, including the learning rates lr [1E-5, 1E-3], the rank of SVD truncation k {1, 2, 4, 8, 16, 32, 64, 128}, and two trade-off loss coefficients β [1E-5, 1] and γ [1E-5, 1]. The results we present are the median outcomes from 5 runs, each conducted with a distinct random seed. To maintain fair trainable parameters, we fine-tune the query and value weights in each Transformer block and set a rank r = 8 across all datasets. More detailed hyperparameters are presented in Appendix E.1. ... Detailed hyperparameter configurations are provided in Table 9. ... The specific hyperparameter configurations used are shown in Table 10.