reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Authors: Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, Weijia Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In response, we introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities. LOKI encompasses video, image, 3D, text, and audio modalities, comprising 18K carefully curated questions across 26 subcategories with clear difficulty levels. The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks, allowing for a comprehensive analysis of LMMs. We evaluated 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
Researcher Affiliation	Collaboration	1 Sun Yat-sen University, 2 Shanghai AI Laboratory, 3 Sense Time Research, 4 The Chinese University of Hong Kong, 5 The Hong Kong University of Science and Technology, 6 SDS, SRIBD, The Chinese University of Hong Kong, Shenzhen
Pseudocode	No	The paper describes its methodology and evaluation process through descriptive text, figures (overview, annotations, questions), and tables of results, but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	No	The paper provides links to tools used for annotation (Label U: https://github.com/opendatalab/label U, Label LLM: https://github.com/opendatalab/Label LLM) and a project website (https://opendatalab.github.io/LOKI/) which may contain code or links to code. However, it does not contain an unambiguous statement that the authors are releasing the source code for the methodology described in this paper, nor does it provide a direct link to a repository containing their implementation code for LOKI benchmark generation or evaluation framework.
Open Datasets	Yes	More information about LOKI can be found at https://opendatalab.github.io/LOKI/. For the LOKI dataset, which is open-sourced, users must submit a download request to the authors to prevent misuse of the data.
Dataset Splits	Yes	LOKI encompasses video, image, 3D, text, and audio modalities, comprising 18K carefully curated questions across 26 subcategories with clear difficulty levels. LOKI classifies question difficulty based on human evaluation metrics. If all tested human users (more than three) answer correctly, the task is classified as easy; if more than 50% answer incorrectly, it is classified as hard; all other cases fall into the medium category. Table 4: Result decomposition across questions difficulty levels. Easy (2470) (1104) (3938) (7512)
Hardware Specification	No	The paper describes the models evaluated (e.g., GPT-4o, Gemini-1.5-Pro) and the evaluation framework. However, it does not specify any concrete hardware details such as GPU models, CPU types, or memory used for running these experiments.
Software Dependencies	No	The paper mentions various models like GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, and refers to tools such as Label U and Label LLM. However, it does not provide specific version numbers for any ancillary software dependencies (e.g., Python, PyTorch, CUDA versions) used for their experimental setup.
Experiment Setup	Yes	Our evaluations are conducted in a zero-shot setting. In the following subsections, we first introduce our evaluation models and the evaluation protocols. For judgement, multiple-choice and abnormal detail selection questions, we use the average accuracy rate as a metric. In addition to accuracy, we also calculate the Normalized Bias Index (NBI) based on recall rates to assess model bias. For open-ended questions regarding anomalous details, we use the GPT-4 model to assess the score of the responses. During inference, models are prompted with two random examples that are in the same domain as the questions by different strategies. In Co T prompting, we manually craft thought chains with our human annotations to elicit reasoning steps out of LMMs, while in FS prompting, we simply prepend examples with answers to the questions.