reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Authors: Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that AVicuna effectively handles temporal understanding in audiovisual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks. Our experiments demonstrate that the AVicuna fine-tuned on PU-VALOR achieves outstanding performance in both coarse-grained QA tasks and fine-grained temporal understanding tasks, as Figure 1 shown. It surpasses most LLM-based video understanding models and sets a new benchmark in the Audio-Visual Event Dense Localization (AVEDL) task. We conduct ablation studies as shown in Table 4 to assess the impact of different components, datasets, and modalities on AVicuna s performance.
Researcher Affiliation	Collaboration	Yunlong Tang1, Daiki Shimada2, Jing Bi1, Mingqian Feng1, Hang Hua1, Chenliang Xu1, * 1University of Rochester 2Sony Group Corporation EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods with formulas and pipeline diagrams (e.g., Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository. It mentions "More details are provided in our technical appendices (Tang et al. 2024b)" but this does not confirm code availability.
Open Datasets	Yes	To tackle the challenge (1), we propose a practical yet straightforward pipeline that leverages the VALOR32K (Chen et al. 2023b) dataset with high-quality audiovisual captions to construct PU (Pseudo-Untrimmed) VALOR dataset... We have also aggregated several audio datasets, including Audio Set (Gemmeke et al. 2017), Audio Cap (Kim et al. 2019), and Auto ACD (Sun et al. 2023), to form a comprehensive audiotext dataset with 222K pairs, termed A5-222K... We use the Intern Vid (Wang et al. 2023d) dataset to enrich visual event alignment training... We evaluate temporal understanding using tasks across various domains: Video Question Answering (Video QA), Audio-visual Video Question Answering (AVQA), and Audio-Visual Event Dense Localization (AVEDL). For General Video QA, zero-shot evaluation is performed on the MSVD-QA (Chen and Dolan 2011), MSRVTT-QA (Xu et al. 2016), and Activity Net-QA (Yu et al. 2019) datasets... AVQA tasks are assessed on the AVSD (Alamri et al. 2019) and MUSIC-AVQA (Li et al. 2022) datasets. The AVEDL task uses the Un AV-100 (Geng et al. 2023) dataset...
Dataset Splits	No	The paper mentions using several datasets for fine-tuning and evaluation, such as LCS-558K, A5-222K, Intern Vid, Un AV-100, MSVD-QA, MSRVTT-QA, Activity Net-QA, AVSD, and MUSIC-AVQA. It states that "zero-shot evaluation is performed" for some tasks, implying standard test sets are used. However, no explicit percentages, sample counts, or detailed methodologies for splitting these datasets into training, validation, and testing sets are provided within the main text.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory configurations used for running the experiments. It states that "More details are provided in our technical appendices (Tang et al. 2024b)" but these are not in the main paper.
Software Dependencies	No	The paper mentions using specific models/frameworks like "CLIP Vi T-14/L (Radford et al. 2021) as Vision Encoder", "CLAP (Elizalde et al. 2023) as Audio Encoder", "Vicuna-7B-v1.5 (Touvron et al. 2023) as our LLM", and fine-tuning "Lo RA (Hu et al. 2022a) parameters". However, it does not specify any programming language versions (e.g., Python 3.8) or library versions (e.g., PyTorch 1.9, CUDA 11.1) needed to reproduce the experiments.
Experiment Setup	No	The paper describes a "four-stage fine-tuning process" and details the components of the AVicuna model and how they interact. It mentions "uniformly extract a minimum of 100 frames from each video" and discusses "Audio-Interleaving Rates (AIR)" and their impact in Figure 4. However, it lacks concrete hyperparameters such as the specific learning rate, batch size, number of epochs, type of optimizer used, or other detailed training configurations. It defers to "technical appendices (Tang et al. 2024b)" for more details.