reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

Authors: Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge, Xiu Li, Ying Shan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to validate the effectiveness of Lo RA-Gen on various commonsense reasoning tasks as well as an agent benchmark. The results demonstrate that our method balances both performance and efficiency, showing significant advantages across eight language datasets. For the edge-side model of Tiny LLa MA-1.1B, Lo RA-Gen outperforms vanilla Lo RA fine-tuning by a remarkable margin with only 16% sequence length, +1.3% on harmonic-mean of accuracy, and 2.1x speedup.
Researcher Affiliation	Collaboration	1Tsinghua University 2ARC Lab, Tencent PCG 3The University of Hong Kong 4Xi an Jiao Tong University. Correspondence to: Xiu Li <EMAIL>.
Pseudocode	No	The paper describes the methodology using text and mathematical equations (e.g., equations 1-8 in Section 3.1 and 3.2), but does not include a distinct pseudocode block or algorithm box.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Following (Dou et al., 2024; Li et al., 2024a), we select eight widely-used benchmarks to assess the reasoning ability of Lo RA-Gen across various knowledge domains ranging from natural science to daily life. One classification task: Bool Q (Clark et al., 2019). Five question-answering tasks: ARC-c (Clark et al., 2018), ARC-e (Clark et al., 2018), Open Book QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020) and Social QA (Sap et al., 2019). One science completion task: Hellaswag (Zellers et al., 2019) and a fill-in-the-blank task: Winogrande (Sakaguchi et al., 2020). We utilize the GPT4Tools (Yang et al., 2024a) which provides a benchmark to evaluate the ability of LLM to use tools...
Dataset Splits	Yes	We divide eight commonly used datasets into two parts, one as the multi-task learning set, including ARC-c, ARC-e, Open Book QA, Bool Q, Social QA and the other as an unseen test set, including Hellaswag, Winogrande and PIQA. We randomly sample to construct multi-shot training data. ... Table 12 outlines the data scale for each reasoning task. Method ARC-c ARC-e OBQA Bool Q SIQA Hella S Wino G PIQA Train 1120 2250 4957 9427 33410 39905 9248 16100 Test 1171 2380 500 3270 1954 10042 1267 1838
Hardware Specification	Yes	All the latencies are measured on the same GPU with 40GB of memory. ... The models are trained with eight NPUs (64GB memory per device) by default. ... Latency is measured on a Nvidia A100 GPU.
Software Dependencies	No	The paper mentions the use of an optimizer (Adam W) and a project (lm-evaluation-harness) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	We deploy LLa MA3-8B (Grattafiori et al., 2024) as the cloud-side LM during online task-specific Lo RA parameters generation. We finetune the q and v projection layers of the LLM with a Lo RA adapter. The number of experts is 8 and we set K in the routing function TOP-K to 2 by default. The coefficient α for auxiliary loss Lcv is set 0.01. ... The models are trained with eight NPUs (64GB memory per device) by default. We set betas and momentum of the Adam W optimizer with (0.9, 0.999) and 0.9, respectively. During training, we utilize a Cosine Scheduler with an initial learning rate of 2e-5 and weight decay of 0.1. The details are shown in Table 10. Hyper-parameters Lo RA-Gen optimizer Adam W learning rate 2e-5 warm steps 50 weight decay 0.1 optimizer momentum β1, β2=0.9, 0.999 batch size 64 epoch 4 max length 2048 Lo RA attention dimension (r) 16 Lo RA scaling alpha (α) 16 Lo RA drop out 0.05