Differentially Private Federated $k$-Means Clustering with Server-Side Data

Authors: Jonathan Scott, Christoph H. Lampert, David Saulpic

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. Our code can be found at https://github.com/jonnyascott/fed-dp-kmeans. We report on experiments for synthetic as well as real datasets in two settings: when we wish to preserve individual data point privacy, as is common for cross-silo federated learning settings (Li et al., 2020), and client-level privacy, as is typically used in cross-device learning settings (Mc Mahan et al., 2017). 5. Experiments We now present our empirical evaluation of Fed DP-KMeans, which we implemented using the pfl-research framework (Granqvist et al., 2024).
Researcher Affiliation Academia Jonathan Scott 1 Christoph H. Lampert 1 David Saulpic 2 *Equal contribution 1Institute of Science and Technology Austria (ISTA) 2CNRS & Universite Paris Cit e, Paris, France. Correspondence to: Jonathan Scott <EMAIL>.
Pseudocode Yes Algorithm 1 Fed DP-Init 1: Input: Client data sets P 1, . . . , P m, # of clusters k, privacy parameters ε1, ε2, ε3G, ε3L, δ Algorithm 2 Fed DP-Lloyds 1: Input: Initial centers ν0 1, . . . , ν0 k, P, steps T, privacy parameters εG, εL, δ
Open Source Code Yes Our code can be found at https://github.com/jonnyascott/fed-dp-kmeans.
Open Datasets Yes We additionally evaluate on US census data using the folktables (Ding et al., 2021) package. We also use the Stack Overflow dataset from Tensorflow Federated.
Dataset Splits No The paper describes how the data is generated and distributed among clients for the clustering task (e.g., "100 clients, with each client having 1000 datapoints"), but it does not specify explicit train/test/validation splits for evaluating a model's performance in a supervised learning context, which is typically what this question refers to. For k-means clustering, the entire dataset is generally used for clustering and then evaluated, rather than splitting into train/test sets.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running its experiments. It mentions using a framework for simulation but no underlying hardware.
Software Dependencies No The paper mentions several software components like the "pfl-research framework (Granqvist et al., 2024)", "folktables (Ding et al., 2021) package", "Tensorflow Federated", "Google's dp accounting library", and a "pre-trained sentence embedding model (Reimers & Gurevych, 2019)". However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Mixture of Gaussians Datasets We generate a mixture of Gaussians in the following way. We set the data dimension to d = 100 and we generate k = 10 mixtures by uniformly randomly sampling k means {µ1, . . . µk} from [0, 1]d. Each mixture has diagonal covariance matrix Σi = 0.5Id and equal mixture weights wi = 1/k. The server data is generated by combining samples from the true mixture distribution together with additional data sampled uniformly randomly from [0, 1]d representing related but out-of-distribution data. We sample 20 points from each mixture component, for a total of 20 k = 200 in distribution points and sample an additional 100 uniform points. H.1. Setting hyperparameters of Fed DP-KMeans In this section we analyze the hyperparameter settings of Fed DP-KMeans that produced the Pareto optimal results shown in the figures in Sections 5.1 and 5.2. These analyses give us some insights on the optimal ways to set the hyperparameters when using Fed DP-KMeans in practice. Distributing the privacy budget The most important parameters to set are the values of epsilon in Parts 1-3 of Algorithm 1. Here we discuss how to set these. Let ε1, ε2, ε3G and ε3L denote the epsilon we allow for part 1, part 2, the Gaussian query in part 3 and the Laplace query in part 3 respectively. We let εinit = ε1 + ε2 + ε3G + ε3L. Specifically, suppose vj is a vector quantity owned by client j, and the server wishes to compute the aggregate v = Pj vj. Then prior to aggregation the client vector is clipped to have maximum norm B so that ( B vj vj, if vj > B vj, otherwise. The aggregate is then computed as ˆv = Pj ˆvj. This query now has sensitivity B, and noise can be added accordingly. Each step of our algorithms can be expressed as such an aggregation over client statistics, the value of B for each step becomes a hyperparameter of the algorithm.