ILLUSION: Unveiling Truth with a Comprehensive Multi-Modal, Multi-Lingual Deepfake Dataset
Authors: Kartik Thakral, Rishabh Ranjan, Akanksha Singh, Akshat Jain, Mayank Vatsa, Richa Singh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmarked image, audio, video, and multi-modal detection models, revealing key challenges such as performance degradation in multilingual and multi-modal contexts, vulnerability to real-world distortions, and limited generalization to zero-day attacks. By bridging synthetic and real-world complexities, ILLUSION provides a challenging yet essential platform for advancing deepfake detection research. |
| Researcher Affiliation | Academia | Kartik Thakral 1, Rishabh Ranjan 1, Akanksha Singh1,2, Akshat Jain1, Mayank Vatsa1, and Richa Singh1 1IIT Jodhpur, India, 2IISER Bhopal, India |
| Pseudocode | No | The paper describes the methodology and experimental procedures in detail through prose and figures but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | The dataset is publicly available at https://www.iab-rubric.org/illusion-database. (This statement refers to the dataset, not the code for the methodologies described in the paper.) |
| Open Datasets | Yes | The dataset is publicly available at https://www.iab-rubric.org/illusion-database. |
| Dataset Splits | Yes | The ILLUSION dataset is composed of four sets. Sets A and B are partitioned into training and testing subsets in a ratio of 3:1. The training data is split into a 9:1 ratio to divide into train and validation data. To mitigate the skew between the Real and Fake classes in set A, we borrow an additional 144 videos (18 subjects/sub-group) from the Celeb V-Text dataset. In contrast, set C and D is exclusively a test set. For all the videos in Set A, we extract 10 frames from each fake video and all from each real video. For Set B, we pick 24 frames each from the generative models and select every sixth frame from real videos. Further, for synthetic images generated from four text-to-image models, we repeat their corresponding real images four times. This approach addresses the imbalance between the dataset s real and fake samples. |
| Hardware Specification | Yes | In Set A, we utilize a total of 13 generation methods to produce identity-swaps across image, audio, video, and audiovideo synchronized modalities. This process is facilitated by Nvidia A100 with 16 GPUs, each with 80GBs of memory. Set B is generated through 11 opensource and one closed-source generative models, utilizing two Nvidia A40 GPUs, each with 48GBs of memory, and three Nvidia DGX stations, each equipped with four V100 GPUs of 32GB memory. Set C comprises samples generated from two proprietary models, produced on 2 Nvidia 3090 GPUs, each with 24 GBs of memory. The benchmarking experiments for the dataset are conducted on 2 A40, each with 48GBs of memory, and 6 A30 GPUs, each with 24GBs of memory, in a multi-GPU setup. |
| Software Dependencies | No | The paper mentions various software components and models used (e.g., 'MMS model', 'Text-to-Speech systems', 'Audalign', 'Huggingface diffusers library', 'Audio Craft library') but does not provide specific version numbers for these components, which are necessary for reproducibility. |
| Experiment Setup | Yes | For all protocols, the models are trained for 30 epochs with early stopping, and the models with the best validation accuracy are selected. We use the Adam optimizer with an initial learning rate of 0.0001. A batch size of 256 is used for distributed training. |