Learning-augmented count-min sketches via Bayesian nonparametrics
Authors: Emanuele Dolera, Stefano Favaro, Stefano Peluchetti
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed to deal with the estimation of low-frequency tokens. An extension of our BNP approach to more general queries, such as range queries, is also discussed. |
| Researcher Affiliation | Collaboration | Emanuele Dolera EMAIL University of Pavia, Italy; Stefano Favaro EMAIL University of Torino and Collegio Carlo Alberto, Italy; Stefano Peluchetti EMAIL Cogent Labs, Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 Sampling Kc l for l = 0, . . . , c |
| Open Source Code | No | The paper does not explicitly state that the authors are releasing their own code. It mentions using the AX library, but this is a third-party tool: 'The optimization procedure is based on Letham at al. (2019), as implemented by the AX library. See https://ax.dev/ for details.' |
| Open Datasets | Yes | We conclude by presenting an application of the CMS-PYP to some textual datasets, for which the distribution of words is typically a power-law distribution (Clauset et al., 2009). In particular, we consider 4 textual datasets of increasing corpora size: the 20 Newsgroups dataset1, the Enron dataset2, the Wiki Text-103 dataset3 and the 1 Billion Word Language Model Benchmark (1BWLMB) dataset4. ... 1. http://qwone.com/~jason/20Newsgroups/ 2. https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ 3. https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/ 4. https://www.statmt.org/lm-benchmark/ |
| Dataset Splits | No | The paper describes the total number of tokens and distinct tokens for the datasets used (e.g., 'The 20 Newsgroups dataset consists of m = 2765300 tokens with 53975 distinct tokens'), and parameters for generating synthetic data, but it does not specify training, testing, or validation splits for these datasets. |
| Hardware Specification | No | The paper describes experimental parameters related to hashing functions and dataset sizes but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using the 'AX library' for optimization but does not provide a specific version number. It describes the mathematical form of hash functions but does not list any versioned software libraries or tools used for their implementation or other parts of the methodology. |
| Experiment Setup | Yes | For each dataset, the estimation of the prior s parameter (α, θ) is performed by means of (32) and (33) with m = 100000 for R = 25. The stochastic objective function (33) is evaluated a total of 50 times for each dataset. ... With regards to synthetic data, we consider datasets generated from Zipf s distributions with exponent c = 1.3, 1.6, 1.9, 2.2. Each dataset consists of m = 500000 tokens. We make use of a 2-universal hash family, and then assume the following pairs of hashing parameters: i) J = 320 and N = 2; ii) J = 160 and N = 4. ... Taking into account the increased corpora sizes we consider the following hashing parameters: i) J = 50000 and N = 2; ii) J = 35000 and N = 4 for Wiki Text-103; i) J = 140000 and N = 2; ii) J = 100000 and N = 4 for 1BWLMB. |