reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ILLUSION: Unveiling Truth with a Comprehensive Multi-Modal, Multi-Lingual Deepfake Dataset

Authors: Kartik Thakral, Rishabh Ranjan, Akanksha Singh, Akshat Jain, Mayank Vatsa, Richa Singh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmarked image, audio, video, and multi-modal detection models, revealing key challenges such as performance degradation in multilingual and multi-modal contexts, vulnerability to real-world distortions, and limited generalization to zero-day attacks. By bridging synthetic and real-world complexities, ILLUSION provides a challenging yet essential platform for advancing deepfake detection research.
Researcher Affiliation	Academia	Kartik Thakral 1, Rishabh Ranjan 1, Akanksha Singh1,2, Akshat Jain1, Mayank Vatsa1, and Richa Singh1 1IIT Jodhpur, India, 2IISER Bhopal, India
Pseudocode	No	The paper describes the methodology and experimental procedures in detail through prose and figures but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	No	The dataset is publicly available at https://www.iab-rubric.org/illusion-database. (This statement refers to the dataset, not the code for the methodologies described in the paper.)
Open Datasets	Yes	The dataset is publicly available at https://www.iab-rubric.org/illusion-database.
Dataset Splits	Yes	The ILLUSION dataset is composed of four sets. Sets A and B are partitioned into training and testing subsets in a ratio of 3:1. The training data is split into a 9:1 ratio to divide into train and validation data. To mitigate the skew between the Real and Fake classes in set A, we borrow an additional 144 videos (18 subjects/sub-group) from the Celeb V-Text dataset. In contrast, set C and D is exclusively a test set. For all the videos in Set A, we extract 10 frames from each fake video and all from each real video. For Set B, we pick 24 frames each from the generative models and select every sixth frame from real videos. Further, for synthetic images generated from four text-to-image models, we repeat their corresponding real images four times. This approach addresses the imbalance between the dataset s real and fake samples.
Hardware Specification	Yes	In Set A, we utilize a total of 13 generation methods to produce identity-swaps across image, audio, video, and audiovideo synchronized modalities. This process is facilitated by Nvidia A100 with 16 GPUs, each with 80GBs of memory. Set B is generated through 11 opensource and one closed-source generative models, utilizing two Nvidia A40 GPUs, each with 48GBs of memory, and three Nvidia DGX stations, each equipped with four V100 GPUs of 32GB memory. Set C comprises samples generated from two proprietary models, produced on 2 Nvidia 3090 GPUs, each with 24 GBs of memory. The benchmarking experiments for the dataset are conducted on 2 A40, each with 48GBs of memory, and 6 A30 GPUs, each with 24GBs of memory, in a multi-GPU setup.
Software Dependencies	No	The paper mentions various software components and models used (e.g., 'MMS model', 'Text-to-Speech systems', 'Audalign', 'Huggingface diffusers library', 'Audio Craft library') but does not provide specific version numbers for these components, which are necessary for reproducibility.
Experiment Setup	Yes	For all protocols, the models are trained for 30 epochs with early stopping, and the models with the best validation accuracy are selected. We use the Adam optimizer with an initial learning rate of 0.0001. A batch size of 256 is used for distributed training.