mmFAS: Multimodal Face Anti-Spoofing Using Multi-Level Alignment and Switch-Attention Fusion

Authors: Geng Chen, Wuyuan Xie, Di Lin, Ye Liu, Miaohui Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate the effectiveness of mm FAS in improving the accuracy of FAS systems, outperforming 10 representative methods. 4 Experimental Validation and Analysis
Researcher Affiliation Academia 1College of CSSE, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University 2State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications 3College of Intelligence and Computing, Tianjin University 4School of Automation, Nanjing University of Posts and Telecommunications
Pseudocode No The paper describes the methodology using text and mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about code release or links to a code repository. There are no phrases like "We release our code..." or links to GitHub.
Open Datasets Yes In this section, we evaluate the performance of representative FAS models on four commonly-used multimodal benchmark datasets. (i) Mm FA (Zhang et al. 2020a) is composed of a vast collection of 1000 subjects and 21000 video clips, incorporating 3 modalities including RGB, depth, and infrared. (ii) Ce Fa (Liu et al. 2021) is the largest database in our experiment, including 23346 videos from 1607 subjects, 4 attack types and three modalities: RGB, depth, and infrared. (iii) WMCA (George et al. 2019) represents the wide multi-channel presentation attack (WMCA) dataset, consisting of 1941 short video recordings under varied conditions and covers 4 modalities (i.e., RGB, depth, infrared, and thermal). (iv) HQ-WMCA (Heusch et al. 2020) contains 2904 recordings from 51 participants with 5 modalities (i.e., RGB, depth, infrared, thermal, and short-wave infrared).
Dataset Splits Yes The ratio of image pairs used for training, validating, and testing are 6:1:13.
Hardware Specification Yes mm FAS is trained for 30 epochs on 1 NVIDIA-A100 GPU with a batch size of 128.
Software Dependencies No Our method is implemented in Py Torch, where all images are resized to 224 224. (While PyTorch is mentioned, a specific version number is not provided, nor are other key software dependencies with their versions.)
Experiment Setup Yes We use a balance sampler to randomly sample data from a dataset while ensuring that the number of bonafide and spoofing images is roughly equal within the same batch. We use three identical independent Vi Ts with a depth of 3 and a dimension of 270. The matching predictor, Fmp, in class-level matching task consists of a 540-wide FC. The fusion modules use transformer blocks with 6 attention heads, a depth of 6, and a dimension of 45 for each head, respectively. The feature dimension is 270. The fused features are passed through a classification head, consisting of two 270-wide FC modules with the Re LU activation. In the training stage, we train our mm FAS model using the Adam optimizer with a weight decay setting to 1e-2, and use a cosine annealing scheduler with a maximum number of warmup steps set to ten percent of total steps and a maximum learning rate of 1e-4. mm FAS is trained for 30 epochs on 1 NVIDIA-A100 GPU with a batch size of 128. For the sake of convenience, α and β are set to 1 based on our experiments.