reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DP-MemArc: Differential Privacy Transfer Learning for Memory Efficient Language Models

Authors: Yanming Liu, Xinyue Peng, Yuwei Zhang, Xiaolan Ke, Songhang Deng, Jiannan Cao, Chen Ma, Mengchen Fu, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, Tianyu Du

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments have demonstrated that DP-Mem Arc effectively provides differential privacy-efficient fine-tuning across different task scenarios.
Researcher Affiliation	Academia	1Zhejiang University 2Southeast University 3Harvard University 4University of California, Los Angeles 5Massachusetts Institute of Technology 6Renmin University of China 7The University of Tokyo 8Tongji University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and designs with explanations and mathematical formulations, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an unambiguous statement about releasing code for the described methodology, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	We conduct experiments on five datasets. Four from the GLUE benchmarks (Wang et al. 2018), which cover different NLP tasks. MNLI: the Multi Genre Natural Language Inference Corpus. QQP: the Quora Question Pairs2 dataset. QNLI: the Stanford Question Answering dataset. SST2: the Stanford Sentiment Treebank dataset. We also select an NLG task E2E dataset (Duvsek, Novikova, and Rieser 2019)
Dataset Splits	Yes	To standardize the training process, we partition each dataset as follows: The text classification dataset includes 50k samples for training, 1k samples for validation, and the remaining data for testing. The E2E dataset includes 42061 samples for training and 4672 samples for validation.
Hardware Specification	No	The paper mentions training in FP16 but does not specify any particular GPU models, CPU models, or other hardware components used for the experiments.
Software Dependencies	No	The paper mentions using "opacus DP (Yousefpour et al. 2021)", "DP-Adam optimizer", and "DP-SGD optimizer" but does not provide specific version numbers for these or other software libraries and dependencies.
Experiment Setup	Yes	We chose a learning rate of 5e-4 and used DP-Adam optimizer as the default optimizer for the model, while DP-SGD optimizer is employed for Prompt DPSGD. For evaluation metrics, we utilize a profiler to track the model s training memory usage, evaluating the mean memory consumption during training. Default Lo RA and Adapters ranks are set to r = 64. For text classification tasks, we compare accuracy. For generation tasks, we employed perplexity, BLEU (Papineni et al. 2002), and ROUGE-L (Lin 2004) as evaluation metrics to comprehensively assess generation quality. In our experiments, we conduct training with a batch size of 32 and sequence length of 128 in FP16.