Counterfactual Debiasing for Physical Audiovisual Commonsense Reasoning

Authors: Daoming Zong, Chaoyue Ding, Kaitao Chen, Yinsheng Li, Shuaiyu Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness and generalizability of CF-PACR, demonstrating considerable improvements over traditional PACR models using counterfactual inference.
Researcher Affiliation Collaboration 1Sense Time Research 2School of Computer Science, Fudan University, Shanghai, China
Pseudocode No The paper describes the CF-PACR framework conceptually and mathematically but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that source code for the methodology is released or provide a link to a code repository.
Open Datasets Yes PACS (Yu et al. 2022) is a video-based audiovisual benchmark designed to evaluate the model s ability to reason about physical commonsense using audio and visual modalities. Yu, S.; Wu, P.; Liang, P. P.; Salakhutdinov, R.; and Morency, L.-P. 2022. PACS: A dataset for physical audiovisual common Sense reasoning. ar Xiv preprint ar Xiv:2203.11130.
Dataset Splits Yes The training, validation, and test sets for PACS-QA consist of 11,044, 1,192, and 1,164 samples respectively. For PACS-Material, the training, validation, and test sets comprise 3,460, 444, and 445 samples respectively.
Hardware Specification Yes All variants were trained on four NVIDIA Tesla V100 GPUs with a batch size of 16, 30 epochs, a weight decay of 1e 4, and an initial learning rate of 1e 3.
Software Dependencies No The paper mentions several pre-trained models and frameworks used (e.g., CLIP, Audio CLIP, MERLOT Reserve, ViT, AST, TDN, DeBERTa-V3) but does not provide specific version numbers for general ancillary software like Python, PyTorch, or TensorFlow that would be needed to replicate the experiment.
Experiment Setup Yes All variants were trained on four NVIDIA Tesla V100 GPUs with a batch size of 16, 30 epochs, a weight decay of 1e 4, and an initial learning rate of 1e 3. For the CF-PACR framework, hyperparameters α, β, γ, and τ were tuned within [0, 1] at 0.1 intervals.