reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ScaleOT: Privacy-utility-scalable Offsite-tuning with Dynamic LayerReplace and Selective Rank Compression

Authors: Kai Yao, Zhaorui Tan, Tiandi Ye, Lichun Li, Yuan Zhao, Wenyan Liu, Wei Wang, Jianke Zhu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments show that Scale OT can achieve nearly lossless offsite tuning performance compared with full fine-tuning while obtaining better model privacy. Experiments Experimental Setup Models and Datasets. We evaluate our method on large language models, including GPT-2-XL (Radford et al. 2019), OPT-1.3B (Zhang et al. 2023b), OPT-6.7B (Zhang et al. 2023b) and LLa MA (Touvron et al. 2023). We validate our method across one generation task Wiki Text (Merity et al. 2017), and eight question answering benchmarks: OBQA (Mihaylov et al. 2018), PIQA (Bisk et al. 2020), ARC (Clark et al. 2018), Hella Swag (Zellers et al. 2019), Sci Q (Welbl, Liu, and Gardner 2017), Web Questions (Berant et al. 2013) and RACE (Lai et al. 2017).
Researcher Affiliation	Collaboration	1Zhejiang University, Hangzhou, China 2Ant Group 3University of Liverpool, Liverpool, the United Kingdom 4East China Normal University, Shanghai China
Pseudocode	No	The paper describes methods like 'Dynamic Layer Replace' and 'Selective Rank Compression' in prose, explaining the steps and rationale, but does not present them in a structured pseudocode or algorithm block.
Open Source Code	No	The paper mentions '1https://github.com/Eleuther AI/lm-evaluation-harness' which is a tool used for evaluation, not the open-source code for the Scale OT methodology described in this paper. There is no explicit statement about releasing the code for the authors' proposed method.
Open Datasets	Yes	Models and Datasets. We evaluate our method on large language models, including GPT-2-XL (Radford et al. 2019), OPT-1.3B (Zhang et al. 2023b), OPT-6.7B (Zhang et al. 2023b) and LLa MA (Touvron et al. 2023). We validate our method across one generation task Wiki Text (Merity et al. 2017), and eight question answering benchmarks: OBQA (Mihaylov et al. 2018), PIQA (Bisk et al. 2020), ARC (Clark et al. 2018), Hella Swag (Zellers et al. 2019), Sci Q (Welbl, Liu, and Gardner 2017), Web Questions (Berant et al. 2013) and RACE (Lai et al. 2017). In the training of the Dynamic Layer Replace, we utilize the Pile corpus (Gao et al. 2020) datasets for language.
Dataset Splits	No	The paper mentions using "Wiki Text" and various question answering benchmarks, and states "For a fair comparison, we adopt the same evaluation metric used in previous studies (Xiao, Lin, and Han 2023)." It also references `lm-eval-harness` for evaluation. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for dataset splits for any of the datasets used, nor does it cite a source that defines the exact splits used for reproduction.
Hardware Specification	Yes	All experiments are conducted on a workstation with 8 V100 GPUs.
Software Dependencies	No	The paper mentions using 'lm-eval-harness' for language model evaluation and 'Adam W Optimizer,' but it does not specify any software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	For Dynamic Layer Replace, Nc = 3 and Ng = 4 are set empirically. For harmonizer, we utilize a simple low-rank FFN with Re LU activation and rank of 64 and 256 for medium and large size LLM respectively. For the construction of emulators, we set α = 0.25 and β = 0.8 by default to balance the privacy-utility trade-off, unless otherwise specified. For fair comparison, Na is set to be consistent with OT (Xiao, Lin, and Han 2023), meaning that only about 10% of the parameters are tuned, as opposed to full fine-tuning. For the offsite tuning phase, we employ the Adam W Optimizer, experimenting with a range of learning rates: [2e-5, 5e-5, 1e-4, 2e-4, 3e-4].