reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Authors: Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok, Arda Senocak, Joon Son Chung, Tae-Hyun Oh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation results on AVHBench reveal that current audio-visual LLMs are prone to both audio-driven and video-driven hallucinations.
Researcher Affiliation	Academia	Kim Sung-Bin1 Oh Hyun-Bin1 Lee Jung-Mok1 Arda Senocak2 Joon Son Chung2 Tae-Hyun Oh1,3,4 1Dept. of Electrical Engineering and 3Grad. School of Artificial Intelligence, POSTECH 2School of Electrical Engineering, KAIST 4School of Computing, KAIST
Pseudocode	No	The paper describes methods through narrative text and a pipeline diagram (Fig. 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Dataset: https://github.com/kaist-ami/AVHBench
Open Datasets	Yes	To address this, we repurpose existing datasets, namely VALOR (Chen et al., 2023b) and Audio Caps (Kim et al., 2019), leveraging their videos and annotations.
Dataset Splits	Yes	In total, our comprehensive test and validation sets comprise 1,106 real and 1,030 synthetic source videos. [...] This dataset contains 10,327 videos with 87,624 Qn A pairs, collected from the training split of the VALOR (Chen et al., 2023b) and Audiocaps (Kim et al., 2019) datasets.
Hardware Specification	Yes	We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch.
Software Dependencies	No	The paper mentions using 'mixed precision' (fp16 for multiplication and fp32 for addition) but does not provide specific software dependencies or library versions such as Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch. [...] We set the rank and alpha value of Lo RA to 16 and 32, respectively.