On Fine-Grained Distinct Element Estimation
Authors: Ilias Diakonikolas, Daniel Kane, Jasper C.H. Lee, Thanasis Pittas, David Woodruff, Samson Zhou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we describe our empirical evaluations for evaluating our distributed protocol for distinct element estimation. We used the CAIDA dataset (CAIDA, 2016)... Our empirical evaluations were performed with Python 3.11.5 on a 64-bit operating system on an Intel(R) Core(TM) i7-3770 CPU, with 16GB RAM and 4 cores with base clock 3.4GHz. |
| Researcher Affiliation | Academia | 1University of Wisconsin-Madison 2University of California, San Diego 3University of California, Davis 4Carnegie Mellon University 5Texas A&M University. Correspondence to: Ilias Diakonikolas <EMAIL>, Daniel M. Kane <EMAIL>, Jasper C.H. Lee <EMAIL>, Thanasis Pittas <EMAIL>, David P. Woodruff <EMAIL>, Samson Zhou <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 (1 + ε)-approximation to F0; Algorithm 2 (1 + ε)-approximation to F0, given an upper bound on the number of collisions; Algorithm 3 Parameterized streaming algoritihm for distinct element estimation using robust statistics; Algorithm 4 Parameterized distinct element estimation over two-pass streams; Algorithm 5 Parameterized distinct element estimation over two-pass streams |
| Open Source Code | Yes | The code is publicly available at https: //github.com/samsonzhou/DKLPWZ25. |
| Open Datasets | Yes | We used the CAIDA dataset (CAIDA, 2016), which consists of anonymized passive traffic traces collected from the high-speed monitor at the equinix-nyc data center. |
| Dataset Splits | No | The paper describes extracting 1 million events from a larger dataset for analysis and partitioning data across receiver IP addresses, but it does not specify explicit training, validation, or test splits for experimental reproduction. |
| Hardware Specification | Yes | Our empirical evaluations were performed with Python 3.11.5 on a 64-bit operating system on an Intel(R) Core(TM) i7-3770 CPU, with 16GB RAM and 4 cores with base clock 3.4GHz. |
| Software Dependencies | Yes | Our empirical evaluations were performed with Python 3.11.5 on a 64-bit operating system... |
| Experiment Setup | Yes | Correspondingly, we set our algorithm to also have accuracy O (ε) and compare the communication, across various values of ε = 1 2p , with p {2, 3, 4, 5, 6, 7, 8, 9, 10, 11}. These results appear in Figure 2a. Finally, we studied the accuracy of our distributed protocol. We evaluated the output of our algorithm for ε = 1 2p , across p {0, 1, 2, 3, 4, 5} and computed the error with respect to the true number of unique sender IP addresses, which totaled 42200. |