Breaking the Quadratic Barrier: Robust Cardinality Sketches for Adaptive Queries

Authors: Edith Cohen, Mihir Singhal, Uri Stemmer

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our fine-grained approach by comparing the number of queries that can be guaranteed compared with the baseline per-query analysis. We use synthetically generated query sets sampled from Uniform and Pareto distributions with α {1.5, 2} and xm = 1, support size of 10^6 and query set size of 5 * 10^3. For each sketch size k, we match a value of the parameter r = 0.002k^2 (with Robust Est and TRobust Est) and respectively t = 0.002k^2 with baseline analysis. Figure 1 reports the number of queries with the baseline and fine-grained estimators. The respective gain factor is measured by the ratio of the number of queries that can be effectively answered with per-key analysis to the baseline. We observe gains of two orders of magnitude for uniformly sampled query sets. This hold even without tracking using Robust Est. For Pareto query sets, we only TRobust Est, as tracking is necessary because some keys appear in many query sketches. We observe gains of 12 for the very skewed α = 1.5 and 40 with α = 2.
Researcher Affiliation Collaboration 1Google Research 2School of Computer Science, Tel Aviv University, Israel 3School of Computer Science, UC Berkeley, Berkeley, CA, USA. Correspondence to: Edith Cohen <EMAIL>, Mihir Singhal <EMAIL>, Uri Stemmer <EMAIL>.
Pseudocode Yes Algorithm 1: Linear Queries with Individual Privacy Charging Algorithm 2: Bottom-k Cardinality Sketch and Standard Estimator Algorithm 3: Basic Robust Cardinality Estimator Algorithm 4: Tracking Robust Estimator
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the methodology described in this paper is publicly available.
Open Datasets No The paper mentions "synthetically generated query sets sampled from Uniform and Pareto distributions". This describes the method of data generation and the types of distributions used, but does not refer to a specific, publicly accessible dataset nor provides any access information (link, DOI, repository, or citation) for a dataset.
Dataset Splits No The paper mentions "synthetically generated query sets sampled from Uniform and Pareto distributions with α {1.5, 2} and xm = 1, support size of 10^6 and query set size of 5 * 10^3". This describes characteristics of the generated data, but it does not specify any training, validation, or test splits for this data, nor does it refer to any standard dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware used for running its experiments, such as GPU models, CPU models, or other computing resources.
Software Dependencies No The paper does not specify any software dependencies with version numbers that would be needed to replicate the experiments.
Experiment Setup Yes We use synthetically generated query sets sampled from Uniform and Pareto distributions with α {1.5, 2} and xm = 1, support size of 10^6 and query set size of 5 * 10^3. For each sketch size k, we match a value of the parameter r = 0.002k^2 (with Robust Est and TRobust Est) and respectively t = 0.002k^2 with baseline analysis.