reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Partitioned Learned Bloom Filters

Authors: Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska

ICLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results from both simulated and real-world datasets show signiﬁcant performance improvements from our optimization approach over both the original learned Bloom ﬁlter constructions and previously proposed heuristic improvements.
Researcher Affiliation	Academia	Kapil Vaidya CSAIL MIT EMAIL Eric Knorr School of Engineering and Applied Sciences Harvard University EMAIL Tim Kraska CSAIL MIT EMAIL Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University EMAIL
Pseudocode	Yes	Algorithm 1 Finding optimal fpr s given thresholds; Algorithm 2 Using relaxed solution for the general problem; Algorithm 3 Solving the general problem
Open Source Code	No	The paper mentions using a third-party 'bloom-filter python package' but does not provide or state the release of its own source code for the described methodology.
Open Datasets	Yes	URLs: As in previous papers [Kraska et al. (2018), Dai & Shrivastava (2019)], we used the URL data set... EMBER: ...Ember (Endgame Malware Benchmark for Research) [Anderson & Roth (2018)] is an open source collection of 1.1M sha256 ﬁle hashes...
Dataset Splits	No	The paper states: 'We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom ﬁlters are then tuned using this model with the aim of achieving the ﬁxed target F.' While tuning implies a validation process, it does not explicitly define a 'validation set' with specific proportions or a clear train/validation/test split.
Hardware Specification	Yes	All the runtime experiments in this subsection and the next subsection are measured using an 2.8GHz quad-core Intel Core i7 CPU @ 2.80GHz with 16GB of memory.
Software Dependencies	No	The paper mentions using 'random forest classiﬁer from sklearn' and the 'bloom-ﬁlter python package' and that algorithms are 'implemented in Python', but it does not specify version numbers for these software components.
Experiment Setup	Yes	We use PLBF Alg.3 with DP algorithm discretization(N) set to 1000. We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom ﬁlters are then tuned using this model with the aim of achieving the ﬁxed target F. ...We use ﬁve regions (k = 5) for both PLBF and Ada BF...