Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Generalized Linear Models in Non-interactive Local Differential Privacy with Public Data

Authors: Di Wang, Lijie Hu, Huanyu Zhang, Marco Gaboardi, Jinhui Xu

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real-world datasets.
Researcher Affiliation Collaboration Di Wang EMAIL CEMSE King Abdullah University of Science and Technology Thuwal, Saudi Arabia Lijie Hu EMAIL CEMSE King Abdullah University of Science and Technology Thuwal, Saudi Arabia Huanyu Zhang EMAIL Meta New York, NY, USA Marco Gaboardi EMAIL Department of Computer Science Boston University Boston, MA 02215, USA Jinhui Xu EMAIL Department of Computer Science and Engineering University at Buffalo, SUNY Buffalo, NY 14260, USA
Pseudocode Yes Algorithm 1 Non-interactive LDP for smooth GLMs with public data (Gaussian) Algorithm 2 Non-interactive LDP for smooth GLMs with public data (General) Algorithm 3 Non-interactive LDP for smooth non-linear regression with public data (Gaussian) Algorithm 4 Non-interactive LDP for smooth non-linear regression with public data (General) Algorithm 5 2-round LDP for smooth GLMs with public data (Gaussian)
Open Source Code No The paper mentions using a 'Logistic Regression classifier in the scikit-learn library (Pedregosa et al., 2011)' and 'standard gradient descent as the baseline method', but does not provide specific access to the authors' own implementation code for the methodology described.
Open Datasets Yes We conduct experiments on binary logistic regression for GLMs on the Covertype dataset (Dua and Graff, 2017), the SUSY dataset (Baldi et al., 2014) and the Skin Segmentation dataset (Dua and Graff, 2017).
Dataset Splits Yes We divide the data into training data and test data, where ntraining = 350, 000 and ntesting = 200, 000 (other data will be used as the public unlabeled data)... For the SUSY dataset...ntraining = 450, 000 and ntesting = 30, 000... For the Skin Segmentation dataset...ntraining = 180, 000 and ntesting = 5, 000.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only discusses experimental settings and software.
Software Dependencies No The paper mentions using the 'Logistic Regression classifier in the scikit-learn library (Pedregosa et al., 2011)' but does not specify its version number or any other software dependencies with versions.
Experiment Setup Yes For privacy parameters, we will choose ϵ between 4 to 20 and set δ = 1 n1.1 . For dimension p, we choose from the set {5, 10, 15, 20, 25, 30, 40, 50, 60}. For different experiments, we will vary different private sample sizes n. However, we will always set the size of public unlabeled data m to be smaller than n. Specifically, without specification, we will always set m = n p2 .