reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data Distillation: A Survey

Authors: Noveen Sachdeva, Julian McAuley

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this survey, we intend to provide a succinct overview of various data distillation frameworks across different data modalities. We start by presenting a formal data distillation framework in Section 2, and present technicalities of various existing techniques. We classify all data distillation techniques into four categories (see Figure 2 for a taxonomy) and provide a detailed empirical comparison of image distillation techniques in Table 1.
Researcher Affiliation	Academia	Noveen Sachdeva EMAIL Computer Science & Engineering University of California, San Diego Julian Mc Auley EMAIL Computer Science & Engineering University of California, San Diego
Pseudocode	Yes	Algorithm 1: Control-flow of data distillation using naïve meta-matching (Equation (4))
Open Source Code	No	The paper mentions an Open Review link for peer review but does not provide any explicit statements about releasing code for the described methodology, nor does it include links to a code repository.
Open Datasets	Yes	Table 1: Comparison of data distillation methods. Each method (1) synthesizes the data summary on the train-set; (2) unless mentioned, trains a 128-width Conv Net (Gidaris & Komodakis, 2018) on the data summary; and (3) evaluates it on the test-set. Confidence intervals are obtained by training at least 5 networks on the data summary. Dataset MNIST CIFAR-10 CIFAR-100 Tiny Image Net ...Textual data is available in large amounts from sources like websites, news articles, academic manuscripts, etc., and is also readily accessible with datasets like the common crawl1 which sizes up to almost 541TB. 1https://commoncrawl.org/the-data/
Dataset Splits	No	The paper mentions synthesizing data summaries on the 'train-set' and evaluating on the 'test-set' for datasets like MNIST, CIFAR-10, CIFAR-100, and Tiny Image Net (in Table 1). While these datasets typically have standard splits, the paper does not explicitly state the percentages, sample counts, or specific methodology used for its own training/test/validation splits.
Hardware Specification	No	The paper generally discusses
Software Dependencies	No	The paper does not provide any specific software names with version numbers used for its experiments.
Experiment Setup	No	Table 1 mentions that models were trained using a '128-width Conv Net (Gidaris & Komodakis, 2018)'. However, it does not provide concrete hyperparameter values such as learning rate, batch size, number of epochs, or specific optimizer settings used for the experimental evaluation.