VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

Authors: Ji Soo Lee, Jongha Kim, Jeehye Na, Jinyoung Park, Hyunwoo J. Kim

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our Vid Chain on two benchmarks-Activitynet Captions and You Cook2 for the challenging DVC task, and Activitynet Captions for temporal video grounding (TVG). In sum, our contributions are three-fold: ... We evaluate our Vid Chain on two benchmarks-Activitynet Captions and You Cook2 for the challenging DVC task, and Activitynet Captions for temporal video grounding (TVG).
Researcher Affiliation Academia Department of Computer Science and Engineering, Korea University EMAIL
Pseudocode No The paper describes methods and processes textually and with mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to code repositories.
Open Datasets Yes We experiment on two different dense video captioning benchmarks, Activity Net Captions (Krishna et al. 2017) and You Cook2 (Zhou, Xu, and Corso 2018).
Dataset Splits Yes We construct 10K and 1K samples for Acitivty Net and You Cook respectively for each path using the pre-defined templates, where the templates are provided in the supplement. Note we refer to each of the two types of dataset as Dt c and Dc t, respectively. Then, we combine our obtained dataset with the DVC QA pairs and dialogues following VTime LLM (Huang et al. 2024). Note that we adopt the full benchmark dataset unlike VTime LLM, which only uses a selected subset for training. This results in DCT of size of 50K for Activity Net and 6K for You Cook2. Overall, we use DCT to finetune Video LLMs, enhancing their performance on fine-grained video understanding tasks, including DVC and its sub-tasks. Further details are in the supplement. ... Visualization is done on Activity Net validation set with VTime LLM in Pc t path.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for conducting the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks used in the implementation.
Experiment Setup No The paper mentions hyperparameters like 'β' and 'γ' but does not provide their specific values or other critical experimental setup details such as learning rate, batch size, optimizer configuration, or number of epochs.