Data Distillation: A Survey
Authors: Noveen Sachdeva, Julian McAuley
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this survey, we intend to provide a succinct overview of various data distillation frameworks across different data modalities. We start by presenting a formal data distillation framework in Section 2, and present technicalities of various existing techniques. We classify all data distillation techniques into four categories (see Figure 2 for a taxonomy) and provide a detailed empirical comparison of image distillation techniques in Table 1. |
| Researcher Affiliation | Academia | Noveen Sachdeva EMAIL Computer Science & Engineering University of California, San Diego Julian Mc Auley EMAIL Computer Science & Engineering University of California, San Diego |
| Pseudocode | Yes | Algorithm 1: Control-flow of data distillation using naïve meta-matching (Equation (4)) |
| Open Source Code | No | The paper mentions an Open Review link for peer review but does not provide any explicit statements about releasing code for the described methodology, nor does it include links to a code repository. |
| Open Datasets | Yes | Table 1: Comparison of data distillation methods. Each method (1) synthesizes the data summary on the train-set; (2) unless mentioned, trains a 128-width Conv Net (Gidaris & Komodakis, 2018) on the data summary; and (3) evaluates it on the test-set. Confidence intervals are obtained by training at least 5 networks on the data summary. Dataset MNIST CIFAR-10 CIFAR-100 Tiny Image Net ...Textual data is available in large amounts from sources like websites, news articles, academic manuscripts, etc., and is also readily accessible with datasets like the common crawl1 which sizes up to almost 541TB. 1https://commoncrawl.org/the-data/ |
| Dataset Splits | No | The paper mentions synthesizing data summaries on the 'train-set' and evaluating on the 'test-set' for datasets like MNIST, CIFAR-10, CIFAR-100, and Tiny Image Net (in Table 1). While these datasets typically have standard splits, the paper does not explicitly state the percentages, sample counts, or specific methodology used for its own training/test/validation splits. |
| Hardware Specification | No | The paper generally discusses |
| Software Dependencies | No | The paper does not provide any specific software names with version numbers used for its experiments. |
| Experiment Setup | No | Table 1 mentions that models were trained using a '128-width Conv Net (Gidaris & Komodakis, 2018)'. However, it does not provide concrete hyperparameter values such as learning rate, batch size, number of epochs, or specific optimizer settings used for the experimental evaluation. |