AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Authors: Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok, Arda Senocak, Joon Son Chung, Tae-Hyun Oh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation results on AVHBench reveal that current audio-visual LLMs are prone to both audio-driven and video-driven hallucinations. |
| Researcher Affiliation | Academia | Kim Sung-Bin1 Oh Hyun-Bin1 Lee Jung-Mok1 Arda Senocak2 Joon Son Chung2 Tae-Hyun Oh1,3,4 1Dept. of Electrical Engineering and 3Grad. School of Artificial Intelligence, POSTECH 2School of Electrical Engineering, KAIST 4School of Computing, KAIST |
| Pseudocode | No | The paper describes methods through narrative text and a pipeline diagram (Fig. 3) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Dataset: https://github.com/kaist-ami/AVHBench |
| Open Datasets | Yes | To address this, we repurpose existing datasets, namely VALOR (Chen et al., 2023b) and Audio Caps (Kim et al., 2019), leveraging their videos and annotations. |
| Dataset Splits | Yes | In total, our comprehensive test and validation sets comprise 1,106 real and 1,030 synthetic source videos. [...] This dataset contains 10,327 videos with 87,624 Qn A pairs, collected from the training split of the VALOR (Chen et al., 2023b) and Audiocaps (Kim et al., 2019) datasets. |
| Hardware Specification | Yes | We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch. |
| Software Dependencies | No | The paper mentions using 'mixed precision' (fp16 for multiplication and fp32 for addition) but does not provide specific software dependencies or library versions such as Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch. [...] We set the rank and alpha value of Lo RA to 16 and 32, respectively. |