Characterizing the Training Dynamics of Private Fine-tuning with Langevin diffusion

Authors: Shuqi Ke, Charlie Hou, Sewoong Oh, Giulia Fanti

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results. We identify the cause of the distortion as the misalignment between the pre-trained backbone and the randomly initialized linear head. We prove that a sequential fine-tuning strategy can mitigate the feature distortion: first-linear-probing-then-fine-tuning (DP-LP-FFT). A new approximation scheme allows us to derive approximate upper and lower bounds on the training loss of DPLP and DP-FFT, in a simple but canonical setting of 2-layer neural networks with Re LU activation. Experiments on real-world datasets and architectures are consistent with our theoretical insights.
Researcher Affiliation Academia Shuqi Ke EMAIL Carnegie Mellon University; Charlie Hou EMAIL Carnegie Mellon University; Sewoong Oh EMAIL University of Washington; Giulia Fanti EMAIL Carnegie Mellon University
Pseudocode No The paper describes methodologies and theoretical approaches, but it does not include any explicitly labeled pseudocode or algorithm blocks. Procedures are described within the main text without structured code-like formatting.
Open Source Code No Reproducibility Statement. We have included full proofs for all theoretical results and sufficient experimental details in appendices to reproduce our results. We will also release our code under a permissive open-source license upon acceptance.
Open Datasets Yes We pre-train Vision Transformers (Vi T) and Res Net-50 backbones on Image Net-1K using Self-Supervised Learning methods... Then we fine-tune the backbone with a linear classification head on CIFAR-10 and STL-10 using DP-SGD. ... Image Net-1K Russakovsky et al. (2015) ... STL-10 Coates et al. (2011) ... CIFAR-10 Krizhevsky (2009).
Dataset Splits Yes We pre-train Vision Transformers (Vi T) and Res Net-50 backbones on Image Net-1K... Then we fine-tune the backbone... on CIFAR-10 and STL-10... The training subset of STL-10 only contains 500 images.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models. It describes experimental configurations but omits hardware specifications.
Software Dependencies No The paper mentions using common deep learning frameworks and libraries implicitly through its methodology (e.g., training neural networks), but it does not specify any software names with version numbers (e.g., Python, PyTorch, CUDA versions) that are needed to replicate the experiments.
Experiment Setup Yes For experiments in Table 1 and Table 2, we use clipping thresholds C=0.1 and C=1, use batch size 1000 and sweep over learning rates {9, 5, 1, 0.5, 0.2, 0.15, 0.1, 0.05, 0.025}. ... We conduct public pre-training for 100 epochs with a batch size of 256. Following this, we implement DP-SGD ... for 30 epochs. Each DP fine-tuning process is repeated with 5 random seeds and a batch size of 1000. ... our Lo RA rank is set to 8.