reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Group Distributionally Robust Dataset Distillation with Risk Minimization

Authors: Saeed Vahidian, Mingyu Wang, Jianyang Gu, Vyacheslav Kungurtsev, Wei Jiang, Yiran Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we present numerical results to validate the efficacy of the proposed robust dataset distillation method. Implementation details are listed in Sec. H. Results on Robustness Settings We first show the notable advantage offered by our proposed method that is the robustness against various domain shifts. This property is assessed through multiple protocols. Firstly, as suggested before, we present validation results on different partitions of the testing set. A clustering process is conducted to divide the original testing set into multiple sub-sets. We test the performance on each of them and report the worst accuracy among the sub-sets to demonstrate the robustness of distilled data, denoted as Cluster-min in Tab. 1.
Researcher Affiliation	Academia	Saeed Vahidian1 Mingyu Wang2 Jianyang Gu3 Vyacheslav Kungurtsev4 Wei Jiang2 Yiran Chen1 1 Duke University 2 Zhejiang University 3 The Ohio State University 4 Czech Technical University
Pseudocode	Yes	Algorithm 1 Robust Dataset Distillation Input: Real training set T , synthetic set S, network with parameter θ, distilling objective L(S). While not converged: Subsample the training set T T . Cluster T by the distillation set, i.e. define, for all t [\| T \|]: C(zt) = arg mini zt [S]i 2 and ci = {zt T : C(zt) = i}. Solve the optimization problem in equation 2 to obtain θ . Optimize synthetic set S with L(S) based on the optimized parameter θ .
Open Source Code	Yes	Additionally, we have attached the adopted source code in the supplementary material, which will further help understand the proposed method.
Open Datasets	Yes	We evaluate our method on the following datasets: SVHN (Yuval, 2011) is a dataset for digits recognition cropped from pictures of house number plates that is widely used for validating image recognition models. CIFAR-10 (Krizhevsky et al., 2009) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. Image Net-10 and Image Net subsets is the subset of Image Net-1K (Deng et al., 2009) containing 10 classes, where each class has approximately 1, 200 images with a resolution of 128 128.
Dataset Splits	Yes	The experiments are conducted under the IPC setting of 10. SVHN comprises three subsets: a training set, a testing set, and an extra set of 530,000 less challenging images that can aid in the training process. CIFAR-10 (Krizhevsky et al., 2009) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. There are 6000 images per class with 5000 training and 1000 testing images per class. Image Net-10 split follows the configuration outlined in IDC (Kim et al., 2022).
Hardware Specification	Yes	All the experiments are conducted on a single 24G RTX 4090 GPU.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are mentioned in the paper.
Experiment Setup	Yes	The CVa R loss and the Cluster-min metric calculation both involve clustering. For the CVa R loss, Euclidean distance is adopted to evaluate the sample relationships. The real samples are assigned to the synthetic sample with the smallest distance. Due to the CVa R loss calculation involving an ample number of samples, the mini-batch size during model updating is increased to 256. In cases where the IPC setting is less than 10, the cluster number in Eq. 2 is set equal to IPC. For larger IPCs, the cluster number is fixed at 10, with 10 random synthesized samples chosen as the clustering centers. The ratio α in CVa R loss is set to 0.8. ... During the network updates, the robust objective proposed in this paper is employed for training. The network training is restricted to an early stage, using only 4,000 images for IDC and 1,000 images for GLa D. 100 steps of synthetic image update together with the network training forms an iteration for IDC, and 2 steps for GLa D. For CIFAR-10 and Image Net subsets, we adopt 2,000 and 500 iterations to complete the distilling process, respectively.