AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Authors: Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... Our experimental results demonstrate that AGAV-Rater achieves state-of-the-art performance on three quality assessment datasets: AGAVQA-MOS, text-to-audio (TTA), and text-to-music (TTM) (Deshmukh et al., 2024). Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience.
Researcher Affiliation Academia 1Institute of Image Communication and Network Engineering, Shanghai Key Laboratory of Digital Media Processing and Transmissions, Shanghai Jiao Tong University, Shanghai 2School of Communication & Electronic Engineering, East China Normal University, Shanghai. Correspondence to: Guangtao Zhai <EMAIL>, Xiongkuo Min <EMAIL>.
Pseudocode No The paper describes the model architecture and training process with figures (e.g., Figure 3), but it does not include a dedicated pseudocode block or algorithm section.
Open Source Code Yes The dataset and code are available at https://github.com/charlotte9524/AGAVRater.
Open Datasets Yes To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... The dataset and code are available at https://github.com/charlotte9524/AGAVRater. ... we create 50, 952 instruction-response pairs related to the perceived quality from 3 large-scale real-world audio-caption datasets, including audio-visual datasets VGGSound (Chen et al., 2020), audio captioning dataset Audio Caps (Kim et al., 2019), and music captioning dataset Music Caps (Agostinelli et al., 2023).
Dataset Splits Yes All experiments for each method are retrained on the AGAVQA-MOS subset using 5-fold cross-validation.
Hardware Specification Yes The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. ... Fine-tuning the AGAV-Rater model on the AGAVQA-MOS subset for 5 epochs using two 96GB H20 GPUs takes approximately 5 hours. ... In Tab. 9, we report the inference latency of AGAV-Rater on AGAVs. On a single RTX 4090 GPU, the model can predict scores for 6.36 videos of 3 seconds or 3.01 videos of 12 seconds per second.
Software Dependencies No The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. The learning rate is set to 1e 5, and the batch size is set to 9.
Experiment Setup Yes The learning rate is set to 1e 5, and the batch size is set to 9. During pre-training, the number of training epochs is set to 1, and optimization is performed. For fine-tuning, the number of training epochs is set to 5 on the AGAVQA-MOS subset and 10 on the TTA and TTM datasets (Deshmukh et al., 2024).