reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models

Authors: Fan Wang, Juyong Jiang, Chansung Park, Sunghun Kim, Jing Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments across a range of LLMs on tasks spanning natural language understanding (NLU), generation (NLG), instruction following, and commonsense reasoning. The experimental results demonstrate that Ka SA consistently outperforms FFT and 14 popular PEFT baselines across 16 benchmarks and 4 synthetic datasets, underscoring our method s efficacy and adaptability.
Researcher Affiliation	Academia	The Hong Kong University of Science and Technology (Guangzhou) Electronics and Telecommunications Research Institute The Hong Kong University of Science and Technology EMAIL EMAIL
Pseudocode	Yes	We present the Py Torch-style pseudocode for Ka SA and its training objective in Appendix A.
Open Source Code	Yes	The source code of our method is available at https://github.com/juyongjiang/Ka SA.
Open Datasets	Yes	We conduct extensive experiments to fine-tune LLMs of varying sizes and architectures across various tasks, including natural language understanding (NLU), natural language generation (NLG), instruction following, and commonsense reasoning. ... on 16 benchmarks and 4 synthetic datasets... We make all high-quality synthetic instruction-following datasets generated by GPT4o publicly available 1, enabling the community to enhance the functionality of PEFT and support future research endeavors. 1https://huggingface.co/llama-duo ... For NLU tasks, we evaluate Ka SA with Ro BERTa (Liu et al., 2021) and De BERTa V3 (He et al., 2022b) on the GLUE (Wang et al., 2018) benchmark. ... For NLG tasks, we assess our method with GPT-2 (Radford et al., 2019) on the E2E NLG Challenge (Novikova et al., 2017) benchmark. We further assess the instruction following performance using well-known LLMs, including LLa MA3 8B (Meta, 2024), Mistal 7B (Jiang et al., 2023), Gemma 7B (Gemma Team, 2024), and LLa MA2 13B (Touvron et al., 2023b). ... Additionally, we fine-tune using the Alpaca dataset (Taori et al., 2023b) and report evaluation results on MT-Bench, with GPT4 serving as the judge, yielding scores within 10. Additionally, we substantiate Ka SA s generality by fine-tuning LLa MA2 7B and LLa MA3 8B models on the Commonsense170K dataset (Hu et al., 2023), which includes training sets from eight commonsense reasoning datasets, and evaluating them on individual test sets of these constituent datasets.
Dataset Splits	Yes	The GLUE benchmark encompasses a wide array of datasets designed to test various aspects of NLU, including question answering, natural language inference, sentiment analysis, and textual entailment. In this context, our evaluation is conducted across 6 datasets from the GLUE: SST-2, MRPC, Co LA, QNLI, RTE, and STS-B. Detailed statistical information about the GLUE benchmark can be found in Appendix C.1. ... For natural language generation (NLG), we utilize the E2E (End-to-End) NLG Challenge dataset (Novikova et al., 2017), which is commonly used for the evaluation of natural language generation models. This dataset includes approximately 42k training samples, 4.6k validation samples, and 4.6k test samples from the restaurant domain. ... Table 5 presents the volume of data samples and token-level statistical information for these task-specific synthetic subsets. ... The Commonsense170K dataset (Hu et al., 2023), which includes training sets from eight commonsense reasoning datasets, and evaluating them on individual test sets of these constituent datasets.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100-SXM4 (80GB) GPUs, except for the NLU experiments, which are conducted on NVIDIA Ge Force RTX 3090 (24GB) GPUs.
Software Dependencies	No	The paper provides Py Torch-style pseudocode in Appendix A, indicating the use of PyTorch. However, it does not specify explicit version numbers for PyTorch or any other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	Implementation Details. Basically, we follow the experimental setup applied in (Hu et al., 2021; Zhang et al., 2022) to ensure a fair comparison. We randomly initialize the knowledge-aware singular values without bias, which only introduces negligible r coefficients in each layer. For all evaluated datasets in GLUE, we meticulously tune the hyperparameters, including the learning rates lr [1E-5, 1E-3], the rank of SVD truncation k {1, 2, 4, 8, 16, 32, 64, 128}, and two trade-off loss coefficients β [1E-5, 1] and γ [1E-5, 1]. The results we present are the median outcomes from 5 runs, each conducted with a distinct random seed. To maintain fair trainable parameters, we fine-tune the query and value weights in each Transformer block and set a rank r = 8 across all datasets. More detailed hyperparameters are presented in Appendix E.1. ... Detailed hyperparameter configurations are provided in Table 9. ... The specific hyperparameter configurations used are shown in Table 10.