Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
Authors: Zeliang Zhang, Susan Liang, Daiki Shimada, Chenliang Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in the Kinetics-Sounds dataset demonstrate that our proposed temporal and modality-based attacks in degrading model performance can achieve state-of-the-art performance, while our adversarial training defense largely improves the adversarial robustness as well as the adversarial training efficiency. |
| Researcher Affiliation | Collaboration | Zeliang Zhang1, Susan Liang1, Daiki Shimada1,2, Chenliang Xu1 1University of Rochester 2Sony Group Corporation EMAIL EMAIL |
| Pseudocode | No | The paper describes methods using mathematical formulations (e.g., equations 1, 2, 3, 5, 6) and textual descriptions of procedures, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code, nor does it provide a link to a code repository or mention code in supplementary materials for the described methodology. |
| Open Datasets | Yes | We use the Kinetics-Sounds (Arandjelovic & Zisserman, 2017) for evaluation, which contains 1, 551, 610 second video clips in 27 human action categories. We also conduct experiments on MIT-MUSIC (Zhao et al., 2018b) for further verification, which is provided in the appendix. |
| Dataset Splits | Yes | For model training, we split the dataset into 7 : 2 : 1 for training, validation, and testing. We split the dataset into 7 : 1 : 2 as the train, validation and test set. |
| Hardware Specification | No | The paper does not specify any particular hardware details such as GPU models, CPU types, or memory used for conducting the experiments. |
| Software Dependencies | No | The paper mentions various model architectures like VGG, Alex Net, and Res Net, but it does not specify any software names with version numbers (e.g., programming languages, libraries, or frameworks with their versions) that were used to implement the experiments. |
| Experiment Setup | Yes | For simplicity, we use the format of { visual backbone }-{ fusion layer }-{ audio backbone } to represent the audio-visual models, where the initials indicate each backbone and layer. We set the model with VGG as the vision backbone, Alex Net as the audio backbone, and concat as the fusion layer, as the surrogate model to generate adversarial examples by FGSM (Goodfellow et al., 2015) under the white-box setting, which is up to 78.3% attack success rate. To align the attack setting, we use 10-step PGD adversarial training as the baseline. On the number of iterations for the attack. On the sampling ratio for the adversarial training. |