reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning-augmented count-min sketches via Bayesian nonparametrics

Authors: Emanuele Dolera, Stefano Favaro, Stefano Peluchetti

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed to deal with the estimation of low-frequency tokens. An extension of our BNP approach to more general queries, such as range queries, is also discussed.
Researcher Affiliation	Collaboration	Emanuele Dolera EMAIL University of Pavia, Italy; Stefano Favaro EMAIL University of Torino and Collegio Carlo Alberto, Italy; Stefano Peluchetti EMAIL Cogent Labs, Tokyo, Japan
Pseudocode	Yes	Algorithm 1 Sampling Kc l for l = 0, . . . , c
Open Source Code	No	The paper does not explicitly state that the authors are releasing their own code. It mentions using the AX library, but this is a third-party tool: 'The optimization procedure is based on Letham at al. (2019), as implemented by the AX library. See https://ax.dev/ for details.'
Open Datasets	Yes	We conclude by presenting an application of the CMS-PYP to some textual datasets, for which the distribution of words is typically a power-law distribution (Clauset et al., 2009). In particular, we consider 4 textual datasets of increasing corpora size: the 20 Newsgroups dataset1, the Enron dataset2, the Wiki Text-103 dataset3 and the 1 Billion Word Language Model Benchmark (1BWLMB) dataset4. ... 1. http://qwone.com/~jason/20Newsgroups/ 2. https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ 3. https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/ 4. https://www.statmt.org/lm-benchmark/
Dataset Splits	No	The paper describes the total number of tokens and distinct tokens for the datasets used (e.g., 'The 20 Newsgroups dataset consists of m = 2765300 tokens with 53975 distinct tokens'), and parameters for generating synthetic data, but it does not specify training, testing, or validation splits for these datasets.
Hardware Specification	No	The paper describes experimental parameters related to hashing functions and dataset sizes but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using the 'AX library' for optimization but does not provide a specific version number. It describes the mathematical form of hash functions but does not list any versioned software libraries or tools used for their implementation or other parts of the methodology.
Experiment Setup	Yes	For each dataset, the estimation of the prior s parameter (α, θ) is performed by means of (32) and (33) with m = 100000 for R = 25. The stochastic objective function (33) is evaluated a total of 50 times for each dataset. ... With regards to synthetic data, we consider datasets generated from Zipf s distributions with exponent c = 1.3, 1.6, 1.9, 2.2. Each dataset consists of m = 500000 tokens. We make use of a 2-universal hash family, and then assume the following pairs of hashing parameters: i) J = 320 and N = 2; ii) J = 160 and N = 4. ... Taking into account the increased corpora sizes we consider the following hashing parameters: i) J = 50000 and N = 2; ii) J = 35000 and N = 4 for Wiki Text-103; i) J = 140000 and N = 2; ii) J = 100000 and N = 4 for 1BWLMB.