Partitioned Learned Bloom Filters
Authors: Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska
ICLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results from both simulated and real-world datasets show significant performance improvements from our optimization approach over both the original learned Bloom filter constructions and previously proposed heuristic improvements. |
| Researcher Affiliation | Academia | Kapil Vaidya CSAIL MIT EMAIL Eric Knorr School of Engineering and Applied Sciences Harvard University EMAIL Tim Kraska CSAIL MIT EMAIL Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University EMAIL |
| Pseudocode | Yes | Algorithm 1 Finding optimal fpr s given thresholds; Algorithm 2 Using relaxed solution for the general problem; Algorithm 3 Solving the general problem |
| Open Source Code | No | The paper mentions using a third-party 'bloom-filter python package' but does not provide or state the release of its own source code for the described methodology. |
| Open Datasets | Yes | URLs: As in previous papers [Kraska et al. (2018), Dai & Shrivastava (2019)], we used the URL data set... EMBER: ...Ember (Endgame Malware Benchmark for Research) [Anderson & Roth (2018)] is an open source collection of 1.1M sha256 file hashes... |
| Dataset Splits | No | The paper states: 'We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom filters are then tuned using this model with the aim of achieving the fixed target F.' While tuning implies a validation process, it does not explicitly define a 'validation set' with specific proportions or a clear train/validation/test split. |
| Hardware Specification | Yes | All the runtime experiments in this subsection and the next subsection are measured using an 2.8GHz quad-core Intel Core i7 CPU @ 2.80GHz with 16GB of memory. |
| Software Dependencies | No | The paper mentions using 'random forest classifier from sklearn' and the 'bloom-filter python package' and that algorithms are 'implemented in Python', but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | We use PLBF Alg.3 with DP algorithm discretization(N) set to 1000. We train the model on the entire key set and 40% of the non-key set. The thresholds and backup Bloom filters are then tuned using this model with the aim of achieving the fixed target F. ...We use five regions (k = 5) for both PLBF and Ada BF... |