Learning-Augmented Search Data Structures
Authors: Chunkai Fu, Brandon G. Nguyen, Jung Seo, Ryan Zesch, Samson Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically show that our learning-augmented search data structures outperforms their corresponding traditional analogs on both synthetic and real-world datasets. |
| Researcher Affiliation | Academia | Equal contribution, Texas A&M University, EMAIL, EMAIL, EMAIL, EMAIL Texas A&M University, EMAIL |
| Pseudocode | Yes | Algorithm 1 Learning-augmented skip list |
| Open Source Code | No | The paper does not provide an explicit statement or link to the source code for the methodology described. |
| Open Datasets | Yes | Skip lists on CAIDA dataset. In the CAIDA datasets (CAIDA, 2016), the receiver IP addresses from one minute of the internet flow data are extracted for testing, which contains over 650k unique IP addresses of the 30 million queries. The AOL dataset (G. Pass, 2006) features around 20M web queries collected from 650k users over three months. |
| Dataset Splits | Yes | For each dataset, we split the overall observation time into an early period, which serves as the training set for the oracle, and a later period, which serves as the query set for the skip list. The indices on the x-axis in Figure 3a means: 10 2: the first 10 minutes of data are used to create the reference (i.e., oracle) and the last 2 minutes are used to build and test the total query time using the former as reference. 2 2: the 9th and 10th minutes data is used as reference and the last 2 minutes are used for testing. 3 3: the 1st, 2nd and 3rd minutes of data are used to create reference and the 4th, 5th and 6th minutes of data are used for testing. 6 6: the first 6 minutes are used to create the reference and the last 6 minutes are used for testing. |
| Hardware Specification | Yes | The computer used for benchmarking is a Lenovo Thinkpad P15 with an intel core i711800H@2.3GHz, 64GB RAM, and 1TB of Solid State Drive. The tests were conducted in a Ubuntu 22.04.3 LTS OS. GNOME version 42.9. The computer used for KD tree benchmarking is a desktop machine with an Intel Core i9-14900KF@ 3200MHz, with 64GB RAM, and 2TB of Solid State Drive. The tests were conducted in Windows 10 Enterprise, version 10.0.19045 Build 19045. |
| Software Dependencies | Yes | The tests were conducted in a Ubuntu 22.04.3 LTS OS. GNOME version 42.9. The tests were conducted in Windows 10 Enterprise, version 10.0.19045 Build 19045. |
| Experiment Setup | Yes | In the synthetic datasets, both the classic and augmented skip lists are tested against different element counts and α values. In terms of the distribution of the synthetic datasets, the uniform distribution and a Zipfian distribution of α between 1.01 and 2 with query counts up to 4 million are evaluated. We first generate datasets of 212 points in 3-dimensional space, with frequencies given by a fixed Zipfian distribution with parameters a = 5, b = 2 parameters at which our method greatly outperforms a standard KD tree. ... We query the tree 214 times, with point queries selected by the ground truth Zipfian distribution. We repeat this process 32 times, and report the median of the average query depth across all runs in Figure 4. |