GMValuator: Similarity-based Data Valuation for Generative Models

Authors: Jiaxi Yang, Wenlong Deng, Benlin Liu, Yangsibo Huang, James Y Zou, Xiaoxiao Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental GMVALUATOR is extensively evaluated on benchmark and high-resolution datasets and various mainstream generative architectures to demonstrate its effectiveness.
Researcher Affiliation Academia 1University of British Columbia 2University of Washington 3Princeton University 4Stanford University 5Vector Institute
Pseudocode Yes A concise summary of key notations and Algorithm 1, detailing the pipeline of GMValuator in Sec. A and Sec. B.
Open Source Code Yes Our code is available at: https://github.com/ubc-tea/GMValuator.
Open Datasets Yes The generation tasks are conducted on benchmark datasets (i.e., MNIST Le Cun et al. (1998) and CIFAR Krizhevsky et al. (2009)), face recognition dataset (i.e., Celeb A Liu et al. (2018)), high-resolution image dataset with size 512 512, and 1024 1024 (i.e., AFHQ Choi et al. (2020), FFHQ Karras et al. (2019)), the large-scale dataset with 1,000 classes and 14,197,122 images (i.e., Image Net Deng et al. (2009)), and text-to-image dataset (i.e., Naruto Cervenka (2022)).
Dataset Splits Yes We support this by partitioning a class of CIFAR-10 (the class is plane here) into two non-overlapped subsets, denoted as Xv1 and Xv2.3 Next, we keep Xv1 as non-training data and use Xv2 as training data to train a Big GAN Brock et al. (2018) and generate dataset ˆX. If our assumption holds, the generated data will be more similar to the training data Xv2.
Hardware Specification Yes GPU One RTX 3080 (10GB) CPU 12 v CPU Intel(R) Xeon(R), Platinum 8255C CPU @ 2.50GHz
Software Dependencies No The paper mentions using specific tools/libraries like CLIP, MANIQA, LPIPS, Dream Sim, and Product Quantization, but does not provide specific version numbers for these or for the underlying programming languages/frameworks.
Experiment Setup Yes We report the averaged ρ over the generated datasets (the data size m=100) on different choices of k in Table 2.