Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

Authors: Guosheng Zhang, Keyao Wang, Haixiao Yue, Ajian Liu, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.
Researcher Affiliation Collaboration 1Department of Computer Vision Technology (VIS), Baidu Inc 2CBSR&MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA)
Pseudocode Yes Algorithm 1: Spoof-aware Captioning and Filtering Input: Dataset D = {(Ii, Y i)}N i=1, where Ii {IR, IF }, Y i {YR, YF }, F = {print, replay, mask, mannequin} Keywords: K = { paper : Yprint, screen : Yreplay, ...} General captioner: CG Output: Dataset Dcap = {(Ii, Y i, T i)}N i=1,where T i {TR, TS} 1: Captioning TF = CG(IF ) 2: Initialize empty dataset DS 3: for each sample (Ii F , Y i F , T i F ) do 4: for each keyword k in K do 5: if k in T i F and K[k] match Y i F then 6: DS DS {(Ii F , Y i F , T i F )} 7: end if 8: end for 9: end for 10: Finetune CG with DS then obtain CS 11: Captioning TR = CG(IR) and TS = CS(IF ) 12: Dcap {(IR, YR, TR)} {(IF , YF , TS)} 13: return Dcap
Open Source Code No The text does not provide an explicit statement of code release or a link to a repository for the methodology described in this paper.
Open Datasets Yes We evaluate our method on two protocols. For Protocol 1, Following established practices, we implement the leave-one-domain-out testing approach on several datasets: MSU-MFSD (M)(Wen, Han, and Jain 2015), CASIA-MFSD (C) (Zhang et al. 2012), Idiap Replay Attack (I) (Chingovska, Anjos, and Marcel 2012), and OULU-NPU (O) (Boulkenafet et al. 2017). To assess the robustness of our method in more demanding conditions, we set up Protocol 2 as One to Eleve testing protocol. Employing only Celeb A-Spoof (Zhang et al. 2020b) as the source domain, and 11 datasets as target domains for cross-domain testing. This selection include MSU-MFSD (Wen, Han, and Jain 2015), CASIAMFSD (Zhang et al. 2012), Idiap Replay Attack (Chingovska, Anjos, and Marcel 2012), OULU-NPU (Boulkenafet et al. 2017), SIW (Liu, Jourabloo, and Liu 2018), Rose Youtu (Li et al. 2018), HKBU-MARs-V1+ (Liu, Lan, and Yuen 2018), WMCA (George et al. 2019), SIW-M-V2 (Guo et al. 2022), CASIA-SURF-3DMask (Yu et al. 2020a) and Hi Fi Mask (Liu et al. 2022a).
Dataset Splits Yes For Protocol 1, Following established practices, we implement the leave-one-domain-out testing approach on several datasets: MSU-MFSD (M)(Wen, Han, and Jain 2015), CASIA-MFSD (C) (Zhang et al. 2012), Idiap Replay Attack (I) (Chingovska, Anjos, and Marcel 2012), and OULU-NPU (O) (Boulkenafet et al. 2017). To assess the robustness of our method in more demanding conditions, we set up Protocol 2 as One to Eleve testing protocol. Employing only Celeb A-Spoof (Zhang et al. 2020b) as the source domain, and 11 datasets as target domains for cross-domain testing.
Hardware Specification No The text does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The text mentions pre-trained models (CLIP, OPT-2.7B) but does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9, CUDA 11.1) needed to replicate the experiment.
Experiment Setup Yes Implementation Details: We crop the face images and resize them to 224 224 3 with RGB channels. For the frozen image encoder, we utilize pre-trained vision models: Vi T-L/14 from CLIP (Radford et al. 2021). Following (Li et al. 2023), OPT-2.7B (Zhang et al. 2022) is adopted as the pre-trained large language model. We use the Adam W optimizer, with an initial learning rate set to 10 5 and a weight decay parameter set to 10 2. We configure our training process with a batch size of 32 and a maximum of 10 epochs for both Protocol 1 and Protocol 2. For Protocol 2, we meticulously reproduce the baseline methods, including FLIP (Srivatsan, Naseer, and Nandakumar 2023) and Vi TAF (Huang et al. 2022), using the official code provided. Both Vi T-B and Vi T-L are pre-trained CLIP (Radford et al. 2021). To ensure the integrity and reproducibility of our experiments, we report all results as the mean of three independent runs, each with a unique initialization seed.