Data Summarization via Bilevel Optimization

Authors: Zalán Borsos, Mojmír Mutný, Marco Tagliasacchi, Andreas Krause

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the advantage of our framework over other data summarization techniques in extensive experimental studies, over a wide range of models and resource-constrained settings, such as continual learning, streaming and batch active learning and dictionary selection for compressed sensing. In this section, we demonstrate the flexibility and effectiveness of our framework for a wide range of models and various settings. We start by evaluating the practical variants of Algorithm 1 proposed Section 3.5, and we compare our method to model-specific coreset constructions and other data summarization strategies in Section 5.2. We then study our approach in the memory-constrained settings of continual learning and streaming in Sections 5.3, 5.4, of dictionary selection in Section 5.6, and the human-resource constrained setting of batch active learning in Section 5.5.
Researcher Affiliation Collaboration Zalán Borsos EMAIL Department of Computer Science ETH Zurich Mojmír Mutný EMAIL Department of Computer Science ETH Zurich Marco Tagliasacchi EMAIL Google Research Andreas Krause EMAIL Department of Computer Science ETH Zurich
Pseudocode Yes Algorithm 1 Bilevel Coreset (Bi Co) ... Algorithm 2 Bilevel Coreset via Regularization ... Algorithm 3 Streaming Bi Co with Merge-reduce Buffer
Open Source Code No The paper does not explicitly state that the authors are releasing their source code, nor does it provide a direct link to a code repository for the methodology described. It mentions using a 'library of Novak et al. (2020)' and a GitHub link for a dataset used, but not for their own implementation.
Open Datasets Yes We choose four standard binary classification data sets (Dua and Graff, 2017; Uzilov et al., 2006) from the LIBSVM library... For MNIST, we use... For CIFAR-10, we use... For SVHN we only use the train split... The Spoken Digit data set (Jackson, 2016) (2700 utterances, 10 classes) and Speech Commands V2 (Warden, 2018) (85000 utterances, 35 classes) data sets...
Dataset Splits Yes We split CIFAR-10 into a train and validation set, where the validation set is a randomly chosen 10% of the original training set... for SVHN we only use the train split, containing approximately 73000 images... For PMNIST, we use a fully connected net... For SMNIST and SCIFAR-10, we use a CNN... We fix the replay memory size m = 100 for tasks derived from MNIST. For SCIFAR-10, we then set the memory size to m = 200... The starting labeled pools are guaranteed to contain at least one sample from each class.
Hardware Specification Yes We calculate the corresponding NTKs without batch normalization and pooling with the library of Novak et al. (2020) on a single Ge Force GTX 1080 Ti GPU, whereas the coreset selection is performed on a single CPU.
Software Dependencies No The paper mentions using specific optimizers like Adam and SGD, and references the 'library of Novak et al. (2020)', but it does not provide specific version numbers for these software components or for general programming languages/frameworks like Python or PyTorch.
Experiment Setup Yes All variants in Section 3.5 use λ = 10 7 regularizer in the inner problem. The inner optimization is performed with Adam using a step size of 0.01 as follows: all variants start with an optimization phase on the initial point set with 5 104 iterations; then, after each step, an additional 104 GD iterations are performed... We use weight decay of 5 10 4 and an initial learning rate of 0.1 cosine-annealed to 0 over 300 n/m epochs, where n is the full data set size and m is the subset size. Additionally, we use dropout with a rate of 0.4 for SVHN. For CIFAR-10, we use the standard data augmentation pipeline of random cropping and horizontal flipping...