reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Unifying Information-theoretic Perspective on Evaluating Generative Models

Authors: Alexis Fox, Samarth Swarup, Abhijin Adiga

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics. We set k to 5, k = 15, and omit the subscripts from the metric abbreviations for brevity hereafter. We follow the recommendation of Stein et al. (2024) to embed the images with the DINOv2-Vi T-L/14 encoder (Oquab et al. 2024), which they claim provides a richer representation space than the commonly used Inception network, which may unfairly punish diffusion models. This motivates our generalized abbreviation FD. Dataset Descriptions. We use both Image Net (Deng et al. 2009) and CIFAR-10 (Krizhevsky, Hinton et al. 2009) image datasets for our analysis.
Researcher Affiliation	Academia	Alexis Fox1, Samarth Swarup2, Abhijin Adiga2 1Duke University 2University of Virginia EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes mathematical formulations and derivations but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/NSSAC/Precision Recall Metric
Open Datasets	Yes	We use both Image Net (Deng et al. 2009) and CIFAR-10 (Krizhevsky, Hinton et al. 2009) image datasets for our analysis.
Dataset Splits	No	The paper mentions the composition of the datasets (e.g., "sampled training set for Image Net contains 1000 classes with 100 images each, while CIFAR-10 has 10 classes with 4500 images each"), but it does not specify explicit training/validation/test splits used for the experiments.
Hardware Specification	No	The paper mentions the use of specific models like "DINOv2-Vi T-L/14 encoder" and "Di T-XL-2 model" for image embedding and generation, but does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used to run these processes or experiments.
Software Dependencies	No	The paper mentions software components like "DINOv2-Vi T-L/14 encoder" and names various models, but it does not provide specific version numbers for any key software libraries, frameworks, or environments (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	No	The paper describes experimental conditions related to model parameters (e.g., "image sets generated at five levels of the CFG parameter", "100 classes were dropped at a time"). However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or general system-level training settings typically found in an experimental setup description.