reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives

Authors: Zeliang Zhang, Susan Liang, Daiki Shimada, Chenliang Xu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments in the Kinetics-Sounds dataset demonstrate that our proposed temporal and modality-based attacks in degrading model performance can achieve state-of-the-art performance, while our adversarial training defense largely improves the adversarial robustness as well as the adversarial training efficiency.
Researcher Affiliation	Collaboration	Zeliang Zhang1, Susan Liang1, Daiki Shimada1,2, Chenliang Xu1 1University of Rochester 2Sony Group Corporation EMAIL EMAIL
Pseudocode	No	The paper describes methods using mathematical formulations (e.g., equations 1, 2, 3, 5, 6) and textual descriptions of procedures, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code, nor does it provide a link to a code repository or mention code in supplementary materials for the described methodology.
Open Datasets	Yes	We use the Kinetics-Sounds (Arandjelovic & Zisserman, 2017) for evaluation, which contains 1, 551, 610 second video clips in 27 human action categories. We also conduct experiments on MIT-MUSIC (Zhao et al., 2018b) for further verification, which is provided in the appendix.
Dataset Splits	Yes	For model training, we split the dataset into 7 : 2 : 1 for training, validation, and testing. We split the dataset into 7 : 1 : 2 as the train, validation and test set.
Hardware Specification	No	The paper does not specify any particular hardware details such as GPU models, CPU types, or memory used for conducting the experiments.
Software Dependencies	No	The paper mentions various model architectures like VGG, Alex Net, and Res Net, but it does not specify any software names with version numbers (e.g., programming languages, libraries, or frameworks with their versions) that were used to implement the experiments.
Experiment Setup	Yes	For simplicity, we use the format of { visual backbone }-{ fusion layer }-{ audio backbone } to represent the audio-visual models, where the initials indicate each backbone and layer. We set the model with VGG as the vision backbone, Alex Net as the audio backbone, and concat as the fusion layer, as the surrogate model to generate adversarial examples by FGSM (Goodfellow et al., 2015) under the white-box setting, which is up to 78.3% attack success rate. To align the attack setting, we use 10-step PGD adversarial training as the baseline. On the number of iterations for the attack. On the sampling ratio for the adversarial training.