reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

Authors: Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. Our experiments demonstrate that the ELITE evaluator aligns better with human judgments than existing automated evaluation methods. Furthermore, through extensive experiments, we validate the diversity and superior quality of the ELITE benchmark, which is designed using the ELITE evaluator. In Table 3, we present comprehensive experimental results of the ELITE benchmark across various proprietary and open-source VLMs.
Researcher Affiliation	Collaboration	1Yonsei University 2Korea Institute of Science and Technology 3AIM Intelligence 4Seoul National University 5Sookmyung Women s University 6Kyung Hee University. Correspondence to: Suhyun Kim <EMAIL>. There is a mix of academic institutions (Yonsei University, Korea Institute of Science and Technology, Seoul National University, Sookmyung Women's University, Kyung Hee University) and a company (AIM Intelligence) among the authors' affiliations, indicating a collaboration.
Pseudocode	No	The paper includes mathematical formulas for ELITE and Strong REJECT evaluators, and a flowchart for benchmark construction (Figure 3), but no clearly labeled pseudocode or algorithm blocks are present.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the ELITE evaluator or benchmark, nor does it provide links to code repositories. It mentions using existing models/platforms and image generation models but not the authors' own implementation code.
Open Datasets	No	The paper introduces the ELITE benchmark, which is a curated dataset, and mentions it is constructed by filtering existing benchmarks and generating new image-text pairs. However, it does not provide concrete access information (e.g., a specific link, DOI, or repository name) for the ELITE benchmark itself. While it utilizes publicly available existing benchmarks, the final ELITE benchmark created by the authors is not explicitly stated as publicly accessible with a concrete download path.
Dataset Splits	No	The paper introduces the ELITE benchmark as an evaluation dataset consisting of 4,587 image-text pairs. It describes how this benchmark is used to evaluate VLMs, but it does not specify any internal training, validation, or testing splits for the ELITE benchmark itself, nor does it mention how the models being evaluated were trained on specific splits. The ELITE benchmark functions as a test set, but no internal partitioning details are provided.
Hardware Specification	No	The paper mentions evaluating various VLMs (GPT-4o, Gemini-2.0, open-source models) but does not provide any specific details about the hardware used to conduct these evaluations or other experiments (e.g., specific GPU or CPU models, memory, or cloud computing instance types).
Software Dependencies	No	The paper mentions several models and platforms used (e.g., GPT-4o, Gemini-2.0, Llama-3.2-11B-Vision, Pixtral-12B, Flux AI, Grok 2) but does not provide specific version numbers for any programming languages, libraries, or software environments used for the authors' experiments.
Experiment Setup	No	The paper states that for open-source models, 'their original hyperparameters are used' (Section 4.1), and describes the benchmark construction pipeline, including filtering thresholds. However, it does not explicitly provide detailed experimental setup parameters such as specific learning rates, batch sizes, number of epochs, or optimizer settings for running their evaluations or generating responses, which are typical for reproducibility.