Data Distillation: A Survey

Authors: Noveen Sachdeva, Julian McAuley

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this survey, we intend to provide a succinct overview of various data distillation frameworks across different data modalities. We start by presenting a formal data distillation framework in Section 2, and present technicalities of various existing techniques. We classify all data distillation techniques into four categories (see Figure 2 for a taxonomy) and provide a detailed empirical comparison of image distillation techniques in Table 1.
Researcher Affiliation Academia Noveen Sachdeva EMAIL Computer Science & Engineering University of California, San Diego Julian Mc Auley EMAIL Computer Science & Engineering University of California, San Diego
Pseudocode Yes Algorithm 1: Control-flow of data distillation using naïve meta-matching (Equation (4))
Open Source Code No The paper mentions an Open Review link for peer review but does not provide any explicit statements about releasing code for the described methodology, nor does it include links to a code repository.
Open Datasets Yes Table 1: Comparison of data distillation methods. Each method (1) synthesizes the data summary on the train-set; (2) unless mentioned, trains a 128-width Conv Net (Gidaris & Komodakis, 2018) on the data summary; and (3) evaluates it on the test-set. Confidence intervals are obtained by training at least 5 networks on the data summary. Dataset MNIST CIFAR-10 CIFAR-100 Tiny Image Net ...Textual data is available in large amounts from sources like websites, news articles, academic manuscripts, etc., and is also readily accessible with datasets like the common crawl1 which sizes up to almost 541TB. 1https://commoncrawl.org/the-data/
Dataset Splits No The paper mentions synthesizing data summaries on the 'train-set' and evaluating on the 'test-set' for datasets like MNIST, CIFAR-10, CIFAR-100, and Tiny Image Net (in Table 1). While these datasets typically have standard splits, the paper does not explicitly state the percentages, sample counts, or specific methodology used for its own training/test/validation splits.
Hardware Specification No The paper generally discusses
Software Dependencies No The paper does not provide any specific software names with version numbers used for its experiments.
Experiment Setup No Table 1 mentions that models were trained using a '128-width Conv Net (Gidaris & Komodakis, 2018)'. However, it does not provide concrete hyperparameter values such as learning rate, batch size, number of epochs, or specific optimizer settings used for the experimental evaluation.