Towards Large Scale Transfer Learning for Differentially Private Image Classification
Authors: Harsh Mehta, Abhradeep Guha Thakurta, Alexey Kurakin, Ashok Cutkosky
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we zoom in on the Image Net dataset and demonstrate that, similar to the non-private case, pre-training over-parameterized models on a large public dataset can lead to substantial gains when the models are finetuned privately. Moreover, by systematically comparing private and non-private models across a range of large batch sizes, we find that similar to the non-private setting, the choice of optimizer can further improve performance substantially with DP. By using the LAMB optimizer, we saw improvement of up to 20% points (absolute). We also show that finetuning just the last layer for a single step in the full batch setting, combined with extremely small-scale (near-zero) initialization leads to both SOTA results of 81.7 % under a wide privacy budget range of ε [4, 10] and δ = 10 6 while minimizing the computational overhead substantially. Finally, we present additional results on CIFAR-10 and CIFAR-100, surpassing previous state of the art by leveraging transfer learning with our recommendations. |
| Researcher Affiliation | Collaboration | Harsh Mehta EMAIL Google Research Abhradeep Thakurta EMAIL Google Research Alexey Kurakin EMAIL Google Research Ashok Cutkosky EMAIL Boston University |
| Pseudocode | Yes | A Algorithmic details We present below a generalized version of DP-SGD where the gradients get processed in the traditional DP-SGD fashion and are then passed to a first order optimizer as an input. This lets us instantiate DP versions of well known optimizers like SGD, Momentum, Adam and LAMB. We prepend the optimizer s name with DP to denote that the gradients were first processed as shown in Algorithm 1 and then passed to the said optimizer. Algorithm 1 Generalized First Order Differentially Private Algorithm |
| Open Source Code | Yes | Code: https://github.com/google-research/google-research/tree/master/dp_transfer |
| Open Datasets | Yes | Datasets. We use the ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1k classes and 1.3M images (we refer to it as Image Net in what follows) as our final evaluation dataset. However, we provide supplementary results in Section F where evaluate on 2 additional datasets, namely CIFAR-10 and CIFAR-100. |
| Dataset Splits | No | We use the ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1k classes and 1.3M images (we refer to it as Image Net in what follows) as our final evaluation dataset. However, we provide supplementary results in Section F where evaluate on 2 additional datasets, namely CIFAR-10 and CIFAR-100. We finetune on Image Net train split and present the Top-1 accuracies we obtain from the official test split. The paper implies standard splits but does not provide explicit percentages, counts, or a citation detailing the splits used. |
| Hardware Specification | Yes | Finally, we conduct our experiments on TPUv4 architecture. All our models were pre-trained using TPUv4 hardware with exact amounts depending on the model. All models were trained using 64 TPUv4 cores. |
| Software Dependencies | No | Our implementation relies on Tensorflow Privacy 1 codebase for conversion of (ε, δ) and clipping norm C to/from noise multiplier σ. We conduct all our experiment using Scenic library (Dehghani et al., 2021) for high quality reproducible implementations of both Res Net (Bi T) and Vision Transformers. Scenic, in turn, uses Flax (Heek et al., 2020) for may of the layer definitions. For the privacy accounting, we rely on the default Rényi accountant implementation already open-sourced as part of Tensorflow Privacy library. No specific version numbers are provided for these software dependencies (Tensorflow Privacy, Jax, Scenic, Flax). |
| Experiment Setup | Yes | At the pre-training stage, we stick with the common practice of employing Adam optimizer (even for Res Net) (Kingma & Ba, 2014) with β1 = 0.9 and β2 = 0.999, with a batch size of 4096 and high weight decay of 0.1 unless mentioned otherwise. We train with sigmoid cross-entropy loss and use linear learning rate warmup until 10k steps, followed by linear decay until the end of training. For our private finetuning experiments, we stick with a reasonably stringent privacy guarantee of ε = 10 and δ = 10 6, unless specified otherwise. We use DP-SGD privacy analysis to compute the noise multiplier. To limit other confounding factors we set the clipping norm C to 1. Also, since training with DP-SGD is computationally expensive, we finetune on Image Net for at most 10 epochs. Finally, when training the last layer with DP we found it to be crucial to initialize the last layer weights to zero (or a small value). (Section 4 Training details) Additionally, Tables 7 and 8 provide detailed finetuning hyperparameters, and Section E.3 describes the setup for Figure 1b, including single-step, full-batch training, zero initialization, and specific input resolutions. |