AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment
Authors: Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... Our experimental results demonstrate that AGAV-Rater achieves state-of-the-art performance on three quality assessment datasets: AGAVQA-MOS, text-to-audio (TTA), and text-to-music (TTM) (Deshmukh et al., 2024). Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. |
| Researcher Affiliation | Academia | 1Institute of Image Communication and Network Engineering, Shanghai Key Laboratory of Digital Media Processing and Transmissions, Shanghai Jiao Tong University, Shanghai 2School of Communication & Electronic Engineering, East China Normal University, Shanghai. Correspondence to: Guangtao Zhai <EMAIL>, Xiongkuo Min <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and training process with figures (e.g., Figure 3), but it does not include a dedicated pseudocode block or algorithm section. |
| Open Source Code | Yes | The dataset and code are available at https://github.com/charlotte9524/AGAVRater. |
| Open Datasets | Yes | To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... The dataset and code are available at https://github.com/charlotte9524/AGAVRater. ... we create 50, 952 instruction-response pairs related to the perceived quality from 3 large-scale real-world audio-caption datasets, including audio-visual datasets VGGSound (Chen et al., 2020), audio captioning dataset Audio Caps (Kim et al., 2019), and music captioning dataset Music Caps (Agostinelli et al., 2023). |
| Dataset Splits | Yes | All experiments for each method are retrained on the AGAVQA-MOS subset using 5-fold cross-validation. |
| Hardware Specification | Yes | The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. ... Fine-tuning the AGAV-Rater model on the AGAVQA-MOS subset for 5 epochs using two 96GB H20 GPUs takes approximately 5 hours. ... In Tab. 9, we report the inference latency of AGAV-Rater on AGAVs. On a single RTX 4090 GPU, the model can predict scores for 6.36 videos of 3 seconds or 3.01 videos of 12 seconds per second. |
| Software Dependencies | No | The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. The learning rate is set to 1e 5, and the batch size is set to 9. |
| Experiment Setup | Yes | The learning rate is set to 1e 5, and the batch size is set to 9. During pre-training, the number of training epochs is set to 1, and optimization is performed. For fine-tuning, the number of training epochs is set to 5 on the AGAVQA-MOS subset and 10 on the TTA and TTM datasets (Deshmukh et al., 2024). |