reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BiDoRA: Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation

Authors: Peijia Qin, Ruiyi Zhang, Pengtao Xie

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation of Bi Do RA on diverse tasks spanning natural language understanding, generation, token classification, and extremely small biomedical datasets reveals that it consistently outperforms Do RA and a wide range of leading PEFT methods. This improvement is statistically significant, as demonstrated on the GLUE benchmark where Bi Do RA surpasses Do RA with a p-value of 2.4 10 4 in terms of the Wilcoxon signed-rank test. Extensive experiments on various downstream tasks highlight the superior performance of Bi Do RA.
Researcher Affiliation	Academia	Peijia Qin EMAIL University of California, San Diego Ruiyi Zhang EMAIL University of California, San Diego Pengtao Xie EMAIL University of California, San Diego
Pseudocode	Yes	Algorithm 1: Bi Do RA Input: Training dataset Dtr and validation dataset Dval
Open Source Code	Yes	The code for Bi Do RA is available at https://github.com/t2ance/Bi Do RA.
Open Datasets	Yes	The GLUE Benchmark (Wang et al., 2019) comprises a diverse array of tasks that are widely employed for evaluation in natural language understanding. The Reuters-21578 (Padmanabhan et al., 2016) dataset is one of the most widely used data collections for text categorization research. In our experiments on natural language generation, we use the E2E (Novikova et al., 2017) dataset. For token classification, we fine-tune the Ro BERTa-base and Ro BERTa-large models on the Bio NLP dataset (Collier et al., 2004) and the Co NLL2003 dataset (Tjong Kim Sang, 2002). We fine-tune the ESM model using the Protein Aligner checkpoint (Zhang et al., 2024a) on two classification tasks thermostability prediction (Chen et al. (2023)...) and blood-brain barrier peptide prediction (BBP, Dai et al. (2021)...) and one regression task, minimum inhibitory concentration prediction (MIC, Ledesma-Fernandez et al. (2023)...)
Dataset Splits	Yes	We create the validation set for upper-level optimization by splitting the original training set with an 8:2 ratio for all tasks. Detailed descriptions of these baseline methods are provided in Appendix C. The GLUE Benchmark (Wang et al., 2019)... We summarize the statistical data for all datasets within the GLUE Benchmark in Table 6. Following existing practices, the development set is used in GLUE as the test data since the actual test set is not publicly available.
Hardware Specification	Yes	For a fair comparison, all methods were benchmarked on a single NVIDIA A100 GPU.
Software Dependencies	No	Our implementation is based on the Huggingface Transformers library (Wolf et al., 2019) and the Betty library (Choe et al., 2023b).
Experiment Setup	Yes	Table 7: The hyperparameters used for Ro BERTa on the GLUE benchmark (Wang et al., 2019), Reuters21578 dataset (Padmanabhan et al., 2016), Bio NLP dataset (Collier et al., 2004), and Co NLL2003 dataset (Tjong Kim Sang, 2002).