ScaleOT: Privacy-utility-scalable Offsite-tuning with Dynamic LayerReplace and Selective Rank Compression
Authors: Kai Yao, Zhaorui Tan, Tiandi Ye, Lichun Li, Yuan Zhao, Wenyan Liu, Wei Wang, Jianke Zhu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments show that Scale OT can achieve nearly lossless offsite tuning performance compared with full fine-tuning while obtaining better model privacy. Experiments Experimental Setup Models and Datasets. We evaluate our method on large language models, including GPT-2-XL (Radford et al. 2019), OPT-1.3B (Zhang et al. 2023b), OPT-6.7B (Zhang et al. 2023b) and LLa MA (Touvron et al. 2023). We validate our method across one generation task Wiki Text (Merity et al. 2017), and eight question answering benchmarks: OBQA (Mihaylov et al. 2018), PIQA (Bisk et al. 2020), ARC (Clark et al. 2018), Hella Swag (Zellers et al. 2019), Sci Q (Welbl, Liu, and Gardner 2017), Web Questions (Berant et al. 2013) and RACE (Lai et al. 2017). |
| Researcher Affiliation | Collaboration | 1Zhejiang University, Hangzhou, China 2Ant Group 3University of Liverpool, Liverpool, the United Kingdom 4East China Normal University, Shanghai China |
| Pseudocode | No | The paper describes methods like 'Dynamic Layer Replace' and 'Selective Rank Compression' in prose, explaining the steps and rationale, but does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper mentions '1https://github.com/Eleuther AI/lm-evaluation-harness' which is a tool used for evaluation, not the open-source code for the Scale OT methodology described in this paper. There is no explicit statement about releasing the code for the authors' proposed method. |
| Open Datasets | Yes | Models and Datasets. We evaluate our method on large language models, including GPT-2-XL (Radford et al. 2019), OPT-1.3B (Zhang et al. 2023b), OPT-6.7B (Zhang et al. 2023b) and LLa MA (Touvron et al. 2023). We validate our method across one generation task Wiki Text (Merity et al. 2017), and eight question answering benchmarks: OBQA (Mihaylov et al. 2018), PIQA (Bisk et al. 2020), ARC (Clark et al. 2018), Hella Swag (Zellers et al. 2019), Sci Q (Welbl, Liu, and Gardner 2017), Web Questions (Berant et al. 2013) and RACE (Lai et al. 2017). In the training of the Dynamic Layer Replace, we utilize the Pile corpus (Gao et al. 2020) datasets for language. |
| Dataset Splits | No | The paper mentions using "Wiki Text" and various question answering benchmarks, and states "For a fair comparison, we adopt the same evaluation metric used in previous studies (Xiao, Lin, and Han 2023)." It also references `lm-eval-harness` for evaluation. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for dataset splits for any of the datasets used, nor does it cite a source that defines the exact splits used for reproduction. |
| Hardware Specification | Yes | All experiments are conducted on a workstation with 8 V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'lm-eval-harness' for language model evaluation and 'Adam W Optimizer,' but it does not specify any software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For Dynamic Layer Replace, Nc = 3 and Ng = 4 are set empirically. For harmonizer, we utilize a simple low-rank FFN with Re LU activation and rank of 64 and 256 for medium and large size LLM respectively. For the construction of emulators, we set α = 0.25 and β = 0.8 by default to balance the privacy-utility trade-off, unless otherwise specified. For fair comparison, Na is set to be consistent with OT (Xiao, Lin, and Han 2023), meaning that only about 10% of the parameters are tuned, as opposed to full fine-tuning. For the offsite tuning phase, we employ the Adam W Optimizer, experimenting with a range of learning rates: [2e-5, 5e-5, 1e-4, 2e-4, 3e-4]. |