reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning-Augmented Frequent Directions

Authors: Anders Aamand, Justin Chen, Siddharth Gollapudi, Sandeep Silwal, Hao WU

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally verify the performance of our learning-augmented algorithms on real data. Following prior work, we consider datasets containing numerous problem instances in a temporal order (each instance being either a sequence of items for frequency estimation or a sequence of matrices for matrix streaming). Using predictions trained on past data, we demonstrate the power of incorporating learned structure in our algorithms, achieving state-of-the-art performance in both settings.
Researcher Affiliation	Academia	Anders Aamand University of Copenhagen EMAIL Justin Y. Chen MIT EMAIL Siddharth Gollapudi Independent EMAIL Sandeep Silwal UW-Madison EMAIL Hao WU University of Waterloo EMAIL
Pseudocode	Yes	Algorithm 1 Frequent Direction AFD Algorithm 2 Learning-Augmented Frequent Direction ALFD Algorithm 3 Robust Learning-based Frequent Direction ARLFD
Open Source Code	No	Both implementations are based on an existing implementation of Frequent Directions4. 4https://github.com/edoliberty/frequent-directions The paper refers to an existing, third-party implementation that was used as a base, but does not explicitly state that the authors' specific code for the described methodology is open-sourced.
Open Datasets	Yes	We use datasets from Indyk et al. (2019) and Indyk et al. (2021), prior works on learning-based low rank approximation not in the streaming setting. The Hyper dataset (Imamoglu et al., 2018) contains a sequence of hyperspectral images of natural scenes. We test our algorithm and baselines on the CAIDA (CAIDA, 2016) and AOL (Pass et al., 2006) datasets used in prior work (Hsu et al., 2019; Aamand et al., 2023).
Dataset Splits	No	The paper discusses evaluating performance on sequences of streams or matrices over time, but does not provide traditional training/test/validation splits with percentages or sample counts. For example: "The CAIDA dataset contains 50 minutes of internet traffic data, with a stream corresponding to the IP addressed associated with packets passing through an ISP over a minute of data." and "The AOL dataset contains 80 days of internet search query data with each stream (corresponding to a day)."
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. It only mentions funding sources for the authors.
Software Dependencies	No	The paper mentions various algorithms like Count Min, Count Sketch, and Misra-Gries, and states that recurrent neural networks were used as predictors, but it does not specify any software libraries or packages with version numbers used for implementation.
Experiment Setup	Yes	For both Frequent Directions and our learned variant, the space used by the streaming algorithm is twice the rank: this is a choice made in the Frequent Directions implementation to avoid running SVD on every insertion and thus improve the update time. We use half of the space for the learned projection component and half for the orthogonal component in our algorithm. For both datasets, we compare the learning-augmented algorithms by plotting the tradeoff between median error and space as well as error across the sequence of streams for a fixed space of 750 (see Figure 2). Randomized algorithms are averaged across 10 trials and one standard deviation is shaded.