Private Regression via Data-Dependent Sufficient Statistic Perturbation

Authors: Cecilia Ferrando, Daniel Sheldon

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally that DD-SSP outperforms the state-of-the-art data-independent SSP method Ada SSP for linear regression, and for logistic regression tasks, DD-SSP achieves better results than the widely used objective perturbation baseline. We also compare DD-SSP with DP-SGD (Abadi et al., 2016), known to achieve excellent performance when hyperparameters are properly fine-tuned. Our results show that the proposed method is competitive with DP-SGD when the privacy cost of hyperparameter tuning is taken into account. In our experiments, we evaluate the effectiveness of DD-SSP on both linear and logistic regression tasks. Figure 3 shows that DD-SSP and AIM-Synth have nearly identical performance and both improve significantly upon Ada SSP on all datasets except ACSIncome, where performance is similar.
Researcher Affiliation Academia Cecilia Ferrando EMAIL Manning College of Information and Computer Sciences University of Massachusetts Amherst Daniel Sheldon EMAIL Manning College of Information and Computer Sciences University of Massachusetts Amherst
Pseudocode Yes Algorithm 1 outlines how to retrieve approximate sufficient statistics XT X and ] XT y from marginals privately estimated by AIM. Algorithm 1 DD-SSP. Algorithm 5 outlines the Ada SSP method for linear regression (Wang, 2018). Algorithm 5 Ada SSP. Algorithm 6 Generalized Objective Perturbation Mechanism (Obj Pert) (Kifer et al., 2012). Algorithm 6 Generalized Objective Perturbation Mechanism. Algorithm 2 AIM (Mc Kenna et al., 2022). Algorithm 3 Initialize pt (Subroutine of Algorithm 2). Algorithm 4 Budget Annealing (Subroutine of Algorithm 2).
Open Source Code Yes 1All experiment code is available at https://github.com/ceciliaferrando/DD-SSP.
Open Datasets Yes We use the following datasets:2 Adult (Becker and Kohavi, 1996): The target variable is num-education (number of education years) for linear regression and income>50K for logistic regression. Fire (Ridgeway et al., 2021): The target variable is Priority (of the call). Taxi (Grégoire et al., 2021): The target variable is totalamount (total fare amount). ACS Datasets (Ding et al., 2021): Data is queried for California (2018). Includes binary classification tasks for PINCP (income above $50k), MIG (mobility), ESR (employment), and PUBCOV (public coverage). ACSincome is also used for linear regression with the target variable PINCP (income) discretized into 20 bins. 2The ACS data is sourced from https://github.com/socialfoundations/folktables. All other datasets are sourced from https://github.com/ryan112358/hd-datasets.
Dataset Splits Yes Data is shuffled and split into 1,000 test points and up to 50,000 training points.
Hardware Specification Yes All experiments were conducted on an internal cluster equipped with Xeon Gold 6240 CPUs @ 2.60GHz, 192GB RAM, and 240GB local SSD storage.
Software Dependencies No The paper mentions several algorithms and methods like AIM (Mc Kenna et al., 2022) and DP-SGD (Abadi et al., 2016) but does not specify the versions of the software or libraries used to implement them (e.g., Python, PyTorch, TensorFlow, scikit-learn with version numbers).
Experiment Setup Yes DP-SGD s hyperparameters are fine-tuned by running a gridsearch for the best parameter. The search space spans the following values: Batch size: [n, 1024, 256] Gradient clipping norm: [0.01, 0.1, 0.2] Number or epochs: [1, 10, 20] Learning rate: [0.001, 0.01, 0.1, 1.0] AIM training: AIM is trained with a model size of 200MB, a maximum of 1,000 iterations, and a workload of all pairwise marginals. We compare the Mean Squared Error (MSE) of DP query-based methods DD-SSP and AIM-Synth against the DP baseline Ada SSP and the public baseline for ϵ {0.05, 0.1, 0.5, 1.0, 2.0}, with a fixed δ = 10 5.