On Fine-Grained Distinct Element Estimation

Authors: Ilias Diakonikolas, Daniel Kane, Jasper C.H. Lee, Thanasis Pittas, David Woodruff, Samson Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we describe our empirical evaluations for evaluating our distributed protocol for distinct element estimation. We used the CAIDA dataset (CAIDA, 2016)... Our empirical evaluations were performed with Python 3.11.5 on a 64-bit operating system on an Intel(R) Core(TM) i7-3770 CPU, with 16GB RAM and 4 cores with base clock 3.4GHz.
Researcher Affiliation Academia 1University of Wisconsin-Madison 2University of California, San Diego 3University of California, Davis 4Carnegie Mellon University 5Texas A&M University. Correspondence to: Ilias Diakonikolas <EMAIL>, Daniel M. Kane <EMAIL>, Jasper C.H. Lee <EMAIL>, Thanasis Pittas <EMAIL>, David P. Woodruff <EMAIL>, Samson Zhou <EMAIL>.
Pseudocode Yes Algorithm 1 (1 + ε)-approximation to F0; Algorithm 2 (1 + ε)-approximation to F0, given an upper bound on the number of collisions; Algorithm 3 Parameterized streaming algoritihm for distinct element estimation using robust statistics; Algorithm 4 Parameterized distinct element estimation over two-pass streams; Algorithm 5 Parameterized distinct element estimation over two-pass streams
Open Source Code Yes The code is publicly available at https: //github.com/samsonzhou/DKLPWZ25.
Open Datasets Yes We used the CAIDA dataset (CAIDA, 2016), which consists of anonymized passive traffic traces collected from the high-speed monitor at the equinix-nyc data center.
Dataset Splits No The paper describes extracting 1 million events from a larger dataset for analysis and partitioning data across receiver IP addresses, but it does not specify explicit training, validation, or test splits for experimental reproduction.
Hardware Specification Yes Our empirical evaluations were performed with Python 3.11.5 on a 64-bit operating system on an Intel(R) Core(TM) i7-3770 CPU, with 16GB RAM and 4 cores with base clock 3.4GHz.
Software Dependencies Yes Our empirical evaluations were performed with Python 3.11.5 on a 64-bit operating system...
Experiment Setup Yes Correspondingly, we set our algorithm to also have accuracy O (ε) and compare the communication, across various values of ε = 1 2p , with p {2, 3, 4, 5, 6, 7, 8, 9, 10, 11}. These results appear in Figure 2a. Finally, we studied the accuracy of our distributed protocol. We evaluated the output of our algorithm for ε = 1 2p , across p {0, 1, 2, 3, 4, 5} and computed the error with respect to the true number of unique sender IP addresses, which totaled 42200.