LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Authors: Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, Weijia Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In response, we introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities. LOKI encompasses video, image, 3D, text, and audio modalities, comprising 18K carefully curated questions across 26 subcategories with clear difficulty levels. The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks, allowing for a comprehensive analysis of LMMs. We evaluated 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities. |
| Researcher Affiliation | Collaboration | 1 Sun Yat-sen University, 2 Shanghai AI Laboratory, 3 Sense Time Research, 4 The Chinese University of Hong Kong, 5 The Hong Kong University of Science and Technology, 6 SDS, SRIBD, The Chinese University of Hong Kong, Shenzhen |
| Pseudocode | No | The paper describes its methodology and evaluation process through descriptive text, figures (overview, annotations, questions), and tables of results, but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper provides links to tools used for annotation (Label U: https://github.com/opendatalab/label U, Label LLM: https://github.com/opendatalab/Label LLM) and a project website (https://opendatalab.github.io/LOKI/) which may contain code or links to code. However, it does not contain an unambiguous statement that the authors are releasing the source code for the methodology described in this paper, nor does it provide a direct link to a repository containing their implementation code for LOKI benchmark generation or evaluation framework. |
| Open Datasets | Yes | More information about LOKI can be found at https://opendatalab.github.io/LOKI/. For the LOKI dataset, which is open-sourced, users must submit a download request to the authors to prevent misuse of the data. |
| Dataset Splits | Yes | LOKI encompasses video, image, 3D, text, and audio modalities, comprising 18K carefully curated questions across 26 subcategories with clear difficulty levels. LOKI classifies question difficulty based on human evaluation metrics. If all tested human users (more than three) answer correctly, the task is classified as easy; if more than 50% answer incorrectly, it is classified as hard; all other cases fall into the medium category. Table 4: Result decomposition across questions difficulty levels. Easy (2470) (1104) (3938) (7512) |
| Hardware Specification | No | The paper describes the models evaluated (e.g., GPT-4o, Gemini-1.5-Pro) and the evaluation framework. However, it does not specify any concrete hardware details such as GPU models, CPU types, or memory used for running these experiments. |
| Software Dependencies | No | The paper mentions various models like GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, and refers to tools such as Label U and Label LLM. However, it does not provide specific version numbers for any ancillary software dependencies (e.g., Python, PyTorch, CUDA versions) used for their experimental setup. |
| Experiment Setup | Yes | Our evaluations are conducted in a zero-shot setting. In the following subsections, we first introduce our evaluation models and the evaluation protocols. For judgement, multiple-choice and abnormal detail selection questions, we use the average accuracy rate as a metric. In addition to accuracy, we also calculate the Normalized Bias Index (NBI) based on recall rates to assess model bias. For open-ended questions regarding anomalous details, we use the GPT-4 model to assess the score of the responses. During inference, models are prompted with two random examples that are in the same domain as the questions by different strategies. In Co T prompting, we manually craft thought chains with our human annotations to elicit reasoning steps out of LMMs, while in FS prompting, we simply prepend examples with answers to the questions. |