reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAP: Privacy-Preserving Fine-Tuning on Language Models with Split-and-Privatize Framework

Authors: Xicong Shen, Yang Liu, Yi Liu, Peiran Wang, Huiqi Liu, Jue Hong, Bing Duan, Zirui Huang, Yunlong Mao, Ye Wu, Sheng Zhong

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	the proposed framework is comprehensively evaluated, demonstrating a 65% improvement in empirical privacy with only a 1% degradation in model performance on the Stanford Sentiment Treebank dataset, outperforming existing state-of-the-art baselines. 5 Experimental Results
Researcher Affiliation	Collaboration	Xicong Shen1 , Yang Liu1, , Yi Liu2, , Peiran Wang3 , Huiqi Liu1 , Jue Hong1 , Bing Duan1 , Zirui Huang4 , Yunlong Mao4 , Ye Wu1 and Sheng Zhong4 1Bytedance 2City University of Hong Kong 3Tsinghua University 4Nanjing University
Pseudocode	No	The paper describes the methods in narrative text and uses a system diagram (Figure 1) to illustrate the workflow, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions '1https://huggingface.co/' as a source for pre-trained models, but there is no explicit statement or link provided for the open-sourcing of the methodology described in this paper.
Open Datasets	Yes	For the classiﬁcation task, we use the classic sentiment analysis datasets, i.e., Financial Phrasebank (FP) [Malo et al., 2014] and Stanford Sentiment Treebank (SST) [Wang et al., 2019], and the topic classiﬁcation dataset, i.e., Blog [Lyu et al., 2020]. For the text generation task, we use the questionanswering dataset SQu AD [Rajpurkar et al., 2016].
Dataset Splits	No	The paper lists benchmark datasets (Financial Phrasebank, Stanford Sentiment Treebank, Blog, SQuAD) but does not provide specific percentages, sample counts, or explicit methodology for how these datasets were split into training, validation, and test sets. It does not cite any predefined splits for these datasets within the paper.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud computing instances) used for running the experiments.
Software Dependencies	No	Our experiments are implemented based on the Transformers library and the PEFT library of Huggingface. Specifically, the LoRA method is adopted to fine-tune the PLM, and the AdamW optimizer with a linear learning rate scheduler is used during finetuning, where the initial learning rate is set to 3e-4. This text mentions software components but does not provide specific version numbers for reproducibility.
Experiment Setup	Yes	Our experiments are implemented based on the Transformers library and the PEFT library of Huggingface. Specifically, the LoRA method is adopted to fine-tune the PLM, and the AdamW optimizer with a linear learning rate scheduler is used during finetuning, where the initial learning rate is set to 3e-4. Empirically, the constant c0 in Eq. (11) is set to (max(UIm) + min(UIm))/2. For the Roberta model, η0 is set to {40, 45, 50, 55, 60, 65, 70}, while for Llama, it is set to {500, 550, 600, 650, 700, 750, 800}.