reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

Authors: Vaibhav Seth, Ayan Sengupta, Arinjay Pathak, Aastha A K Verma, Natraj Raman, Sriram Gopalakrishnan, Niladri Chatterjee, Tanmoy Chakraborty

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform thorough empirical analysis with five natural language understanding (NLU) and six natural language generation (NLG) tasks with three pre-trained LLMs Ro BERTa-base (Liu et al., 2019), LLa MA-1-7B (Touvron et al., 2023), abd LLa MA-3.2-3B-Instruct (Grattafiori et al., 2024a). Empirical results on NLU tasks suggest that Monte CLo RA is more stable, where the average spread of accuracy distribution is 10% lower than Lo RA and 50% lower than full fine-tuning.
Researcher Affiliation	Collaboration	Ayan Sengupta , EMAIL Indian Institute of Technology Delhi, India Natraj Raman EMAIL JPMorgan AI Research
Pseudocode	Yes	Algorithm 1 Monte CLo RA Estimation of Lo RA Parameters
Open Source Code	Yes	The source code of Monte CLo RA is made available at https://github.com/LCS2-IIITD/Monte CLo RA.
Open Datasets	Yes	For NLU, we use five tasks from the GLUE (Wang et al., 2018) and Super GLUE (Wang et al., 2019) benchmarks, namely MRPC (Dolan & Brockett, 2005), Co LA (Warstadt et al., 2019), RTE (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), Wi C (Pilehvar & Camacho Collados, 2018) and Bool Q (Clark et al., 2019). ... On commonsense NLG, we consider six commonsense reasoning tasks Pi QA (Bisk et al., 2020), Social (Sap et al., 2019), Wino Grande (Sakaguchi et al., 2021), ARC-easy, ARC-challenge (Clark et al., 2018) and Open Book QA (Mihaylov et al., 2018). ... For these tasks, we use the Commonsense15K dataset (Hu et al., 2023). ... For evaluation on mathematical reasoning tasks, we consider the GSM8k (Cobbe et al., 2021) dataset. For code generation, we construct a dataset from the publicly available Magicoder-OSS-Instruct-75K corpus (Wei et al., 2024).
Dataset Splits	Yes	These gold standard datasets for these tasks contain separate train and dev (also called validation) split, where we use the train dataset to fine-tune LLMs and dev dataset for evaluation. For these tasks, we consider the intrinsic evaluation metric negative loglikelihood (NLL) and extrinsic evaluation metric accuracy. ... We use the official test split of GSM8k for math reasoning and Human Eval benchmark (Chen et al., 2021), a widely-used dataset for functional code synthesis performance.
Hardware Specification	Yes	All our experiments were conducted on NVIDIA A100-80GB GPUs that had access to the CUDA 12.5 environment.
Software Dependencies	Yes	All our experiments were conducted on NVIDIA A100-80GB GPUs that had access to the CUDA 12.5 environment.
Experiment Setup	Yes	Table 3 contains the static and tunable hyperparameters used for all the fine-tuning methods with Ro BERTa and LLa MA models. Static hyperparameters are used for all the model-specific training tasks and are the same for all the fine-tuning strategies. We tune the optimizer learning rate and the training batch size for the robustness studies for different strategies. Table 4 reports the hyperparameters specific to Monte CLo RA.