Multi-modal Deepfake Detection via Multi-task Audio-Visual Prompt Learning

Authors: Hui Miao, Yuanfang Guo, Zeming Liu, Yunhong Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate the effectiveness and superior generalization ability of our method against the stateof-the-art methods. ... Experiments Experiment Settings Datasets Implementation Details Comparisons with the Existing Methods Intra-dataset Evaluation Cross-manipulation Evaluation Cross-dataset Evaluation Ablation Study
Researcher Affiliation Academia Hui Miao, Yuanfang Guo*, Zeming Liu, Yunhong Wang School of Computer Science and Engineering, Beihang University, China EMAIL
Pseudocode No The paper describes methods using mathematical formulations and textual descriptions but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes To validate the performance of our method, we evaluate our model on Fake AVCeleb (Khalid et al. 2021b)... we also evaluate the methods on a subset of Ko DF (Kwon et al. 2021) to assess the cross-dataset generalization.
Dataset Splits Yes Specifically, the training set consists of South Asian, East Asian, and American Caucasian, the validation set contains African, and the testing set contains European Caucasian. ... In addition, by following (Feng, Chen, and Owens 2023; Oorloff et al. 2024), we also evaluate the methods on a subset of Ko DF (Kwon et al. 2021) to assess the cross-dataset generalization.
Hardware Specification Yes Note that our method is trained on a single RTX 3080ti GPU with 25G CPU memory on Ubuntu 20.04.
Software Dependencies No The paper mentions 'Ubuntu 20.04' as the operating system, but does not specify any programming languages, libraries, or other software dependencies with version numbers.
Experiment Setup Yes The number of the visual prompt tokens Nvpt is set to 1. The weights of the CMFM loss are as set to α = 2, β = 2, γ = 1. For the training process, we randomly sample segments from each video and utilize 15 training epochs with the Adam (Kingma and Ba 2014) optimizer. The initial learning rate is set to 0.0001, with a reduction by a factor of 10 occurring at the 12th epoch.