Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
Authors: Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that AVicuna effectively handles temporal understanding in audiovisual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks. Our experiments demonstrate that the AVicuna fine-tuned on PU-VALOR achieves outstanding performance in both coarse-grained QA tasks and fine-grained temporal understanding tasks, as Figure 1 shown. It surpasses most LLM-based video understanding models and sets a new benchmark in the Audio-Visual Event Dense Localization (AVEDL) task. We conduct ablation studies as shown in Table 4 to assess the impact of different components, datasets, and modalities on AVicuna s performance. |
| Researcher Affiliation | Collaboration | Yunlong Tang1, Daiki Shimada2, Jing Bi1, Mingqian Feng1, Hang Hua1, Chenliang Xu1, * 1University of Rochester 2Sony Group Corporation EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods with formulas and pipeline diagrams (e.g., Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. It mentions "More details are provided in our technical appendices (Tang et al. 2024b)" but this does not confirm code availability. |
| Open Datasets | Yes | To tackle the challenge (1), we propose a practical yet straightforward pipeline that leverages the VALOR32K (Chen et al. 2023b) dataset with high-quality audiovisual captions to construct PU (Pseudo-Untrimmed) VALOR dataset... We have also aggregated several audio datasets, including Audio Set (Gemmeke et al. 2017), Audio Cap (Kim et al. 2019), and Auto ACD (Sun et al. 2023), to form a comprehensive audiotext dataset with 222K pairs, termed A5-222K... We use the Intern Vid (Wang et al. 2023d) dataset to enrich visual event alignment training... We evaluate temporal understanding using tasks across various domains: Video Question Answering (Video QA), Audio-visual Video Question Answering (AVQA), and Audio-Visual Event Dense Localization (AVEDL). For General Video QA, zero-shot evaluation is performed on the MSVD-QA (Chen and Dolan 2011), MSRVTT-QA (Xu et al. 2016), and Activity Net-QA (Yu et al. 2019) datasets... AVQA tasks are assessed on the AVSD (Alamri et al. 2019) and MUSIC-AVQA (Li et al. 2022) datasets. The AVEDL task uses the Un AV-100 (Geng et al. 2023) dataset... |
| Dataset Splits | No | The paper mentions using several datasets for fine-tuning and evaluation, such as LCS-558K, A5-222K, Intern Vid, Un AV-100, MSVD-QA, MSRVTT-QA, Activity Net-QA, AVSD, and MUSIC-AVQA. It states that "zero-shot evaluation is performed" for some tasks, implying standard test sets are used. However, no explicit percentages, sample counts, or detailed methodologies for splitting these datasets into training, validation, and testing sets are provided within the main text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory configurations used for running the experiments. It states that "More details are provided in our technical appendices (Tang et al. 2024b)" but these are not in the main paper. |
| Software Dependencies | No | The paper mentions using specific models/frameworks like "CLIP Vi T-14/L (Radford et al. 2021) as Vision Encoder", "CLAP (Elizalde et al. 2023) as Audio Encoder", "Vicuna-7B-v1.5 (Touvron et al. 2023) as our LLM", and fine-tuning "Lo RA (Hu et al. 2022a) parameters". However, it does not specify any programming language versions (e.g., Python 3.8) or library versions (e.g., PyTorch 1.9, CUDA 11.1) needed to reproduce the experiments. |
| Experiment Setup | No | The paper describes a "four-stage fine-tuning process" and details the components of the AVicuna model and how they interact. It mentions "uniformly extract a minimum of 100 frames from each video" and discusses "Audio-Interleaving Rates (AIR)" and their impact in Figure 4. However, it lacks concrete hyperparameters such as the specific learning rate, batch size, number of epochs, type of optimizer used, or other detailed training configurations. It defers to "technical appendices (Tang et al. 2024b)" for more details. |