reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: All Current Generative Fidelity and Diversity Metrics are Flawed

Authors: Ossi Räisä, Boris Van Breugel, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used. The results of our evaluation in Tables 3 and 4 lead to our position. All of the fidelity and diversity metrics fail a large number of sanity checks, in many cases failing to measure even the basic property that they are supposed to measure, which we argue means that all current generative fidelity and diversity metrics are flawed.
Researcher Affiliation	Academia	1University of Helsinki 2University of Cambridge.
Pseudocode	No	The paper describes the methods and sanity checks in detail, including mathematical formulations for metrics and criteria for success. However, it does not contain any formally labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Our implementation code is available.1 1https://github.com/vanderschaarlab/position-fidelity-diversity-metrics-flawed
Open Datasets	No	The paper's evaluation relies on artificially generated data distributions (e.g., Gaussian, hypercubes, hyperspheres) for its sanity checks, as described in Section 4 and Appendix B. It does not use established, publicly available benchmark datasets that would require specific access information or citations.
Dataset Splits	No	The paper generates synthetic data and real data for comparison within its sanity checks, often specifying the sizes of these generated datasets (e.g., 'We set the real and synthetic dataset sizes to 1000'). However, it does not involve traditional training, validation, or test splits as typically used for model training and evaluation with benchmark datasets.
Hardware Specification	Yes	Specifically, computing the metrics of Kim et al. (2023) for 1000 real and synthetic samples of 2-dimensional Gaussian distributions with the original implementation did not finish in 30 minutes on an M1 Mac Book Air.
Software Dependencies	No	Appendix A lists several third-party source code repositories for the metrics being evaluated (e.g., 'I-Prec, I-Rec, Density, Coverage: https://github.com/clovaai/generative-evaluation-prdc'). However, the paper does not specify the version numbers of the ancillary software (such as Python, or specific libraries used for data generation or their own implementation) for their experiments.
Experiment Setup	Yes	In this check, the real distribution is a d-dimensional Gaussian standard Gaussian distribution, and the synthetic distribution is a similar Gaussian with mean µ1d, where µ R and 1d is a d-dimensional vector of all ones. We vary d {1, 8, 64} and [ 6, 6], d = 1 [ 3, 3], d = 8 [ 1, 1], d = 64. ... We set the real and synthetic dataset sizes to 1000.