AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Authors: Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok, Arda Senocak, Joon Son Chung, Tae-Hyun Oh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation results on AVHBench reveal that current audio-visual LLMs are prone to both audio-driven and video-driven hallucinations.
Researcher Affiliation Academia Kim Sung-Bin1 Oh Hyun-Bin1 Lee Jung-Mok1 Arda Senocak2 Joon Son Chung2 Tae-Hyun Oh1,3,4 1Dept. of Electrical Engineering and 3Grad. School of Artificial Intelligence, POSTECH 2School of Electrical Engineering, KAIST 4School of Computing, KAIST
Pseudocode No The paper describes methods through narrative text and a pipeline diagram (Fig. 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Dataset: https://github.com/kaist-ami/AVHBench
Open Datasets Yes To address this, we repurpose existing datasets, namely VALOR (Chen et al., 2023b) and Audio Caps (Kim et al., 2019), leveraging their videos and annotations.
Dataset Splits Yes In total, our comprehensive test and validation sets comprise 1,106 real and 1,030 synthetic source videos. [...] This dataset contains 10,327 videos with 87,624 Qn A pairs, collected from the training split of the VALOR (Chen et al., 2023b) and Audiocaps (Kim et al., 2019) datasets.
Hardware Specification Yes We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch.
Software Dependencies No The paper mentions using 'mixed precision' (fp16 for multiplication and fp32 for addition) but does not provide specific software dependencies or library versions such as Python, PyTorch, or CUDA versions.
Experiment Setup Yes We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch. [...] We set the rank and alpha value of Lo RA to 16 and 32, respectively.