reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CGI: Identifying Conditional Generative Models with Example Images

Authors: Zhi Zhou, Hao-Zhe Tan, Peng-Xiao Song, Lan-Zhe Guo

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate PMI approach and promote related research, we provide a benchmark comprising 65 models and 9100 identification tasks. Extensive experimental and human evaluation results demonstrate that PMI is effective. For instance, 92% of models are correctly identified with significantly better FID scores when four example images are provided.
Researcher Affiliation	Academia	1National Key Laboratory for Novel Software Technology, Nanjing University, China EMAIL
Pseudocode	No	The paper describes algorithms and methods using mathematical equations and textual descriptions, for example, in Section 4 "Our Approach", but does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "To promote relevant research, we open-sourced a benchmark based on stable diffusion models with 65 conditional generative models and 9100 model identification tasks." This explicitly mentions open-sourcing a benchmark, but not the source code for the methodology (PMI) itself.
Open Datasets	Yes	To evaluate the effectiveness of PMI and promote the related research, we developed a benchmark comprising 65 conditional generative models and 9100 model identification tasks. Extensive experiment results demonstrate that PMI is effective. Moreover, human and GPT evaluation results confirm both the validity of our evaluation protocol and the superior performance of PMI. Our main contributions can be summarized as follows: (c) We develop a benchmark with 65 models and 9100 identification tasks to evaluate model identification approaches. Extensive experiments and human evaluation results demonstrate that our proposal can achieve satisfactory model identification performance. In this paper, we study a novel problem setting called Conditional Generative Model Identification, whose objective is to describe the functionalities of conditional generative models and enable the model to be accurately and efficiently identified for future users. To this end, we present a systematic solution including three key components. The Automatic Specification Assignment and Requirement Generation respectively project the model functionality and user requirements into a unified matching space. The Task-Specific Matching further builds the task-specific specification in the matching space to precisely identify the most suitable model. To promote relevant research, we open-sourced a benchmark based on stable diffusion models with 65 conditional generative models and 9100 model identification tasks.
Dataset Splits	Yes	For model identification task construction, we created 14 evaluation prompts {pτ1, . . . , pτ14}m for each model on the model hub to generate testing images with random seeds in {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, forming Mτ = 14 * 65 * 10 = 9100 different identification tasks {(xτi, ti)}Mτ i=1, where each example image xτi is generated by model fti and its best matching model index is ti.
Hardware Specification	Yes	Our experiments are conducted on Linux servers with NVIDIA A800 GPUs.
Software Dependencies	No	The paper states: "We adopt the official code in [Wu et al., 2023] to implement the RKME method and the official code in [Radford et al., 2021] to implement the pre-trained vision-language model." While it mentions using existing code, it does not specify version numbers for any software dependencies used in their implementation.
Experiment Setup	Yes	The hyperparameter γ for calculating RBF kernel and similarity score is tuned from {0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.02, 0.03, 0.04, 0.05} and set to 0.02. For all experiments without additional notes, we assume that the specification is generated with developer-provided prompts.