reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound. ... We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset... Table 1: Evaluation results for Video-to-Audio generation across three test sets... Table 3: We explore the effect of different cinematic language variations f during training...
Researcher Affiliation	Academia	Feizhen Huang , Yu Wu , Yutian Lin and Bo Du School of Computer Science, Wuhan University EMAIL
Pseudocode	No	The paper describes methods and models using mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a specific link to source code, nor does it contain an explicit statement about releasing its code in supplementary materials or otherwise. It mentions building upon 'the open-source Diff Foley [Luo et al., 2024]' but this refers to a third-party tool, not the authors' own implementation code.
Open Datasets	Yes	We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset containing over 200,000 video clips across 309 distinct sound categories.
Dataset Splits	Yes	We follow the original VGGSound train/test split. ... To evaluate performance under partial visibility, we create two modified test sets by applying cinematic language variations to the VGGSound [Chen et al., 2020a] test set.
Hardware Specification	Yes	The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32.
Software Dependencies	No	The paper mentions using a pre-trained video encoder from CAVP [Luo et al., 2024] and building upon Diff Foley [Luo et al., 2024], but it does not specify version numbers for these or any other software components (e.g., programming languages, libraries, frameworks).
Experiment Setup	Yes	The input video clips are sampled at 4 frames per second (FPS)... For training, we only apply cinematic language variation fcu on VGGSound [Chen et al., 2020a] training set with k = 75%, where a1 = 0.4 and a2 = 0.6. The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32. ... we use only CFG [Ho and Salimans, 2022] configuration in Diff-Foley, keeping all other experimental settings unchanged, including the DPM-Solver [Lu et al., 2022] Sampler with 25 inference steps and CFG scale ̗ = 4.5.