VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
Authors: Ji Soo Lee, Jongha Kim, Jeehye Na, Jinyoung Park, Hyunwoo J. Kim
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our Vid Chain on two benchmarks-Activitynet Captions and You Cook2 for the challenging DVC task, and Activitynet Captions for temporal video grounding (TVG). In sum, our contributions are three-fold: ... We evaluate our Vid Chain on two benchmarks-Activitynet Captions and You Cook2 for the challenging DVC task, and Activitynet Captions for temporal video grounding (TVG). |
| Researcher Affiliation | Academia | Department of Computer Science and Engineering, Korea University EMAIL |
| Pseudocode | No | The paper describes methods and processes textually and with mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to code repositories. |
| Open Datasets | Yes | We experiment on two different dense video captioning benchmarks, Activity Net Captions (Krishna et al. 2017) and You Cook2 (Zhou, Xu, and Corso 2018). |
| Dataset Splits | Yes | We construct 10K and 1K samples for Acitivty Net and You Cook respectively for each path using the pre-defined templates, where the templates are provided in the supplement. Note we refer to each of the two types of dataset as Dt c and Dc t, respectively. Then, we combine our obtained dataset with the DVC QA pairs and dialogues following VTime LLM (Huang et al. 2024). Note that we adopt the full benchmark dataset unlike VTime LLM, which only uses a selected subset for training. This results in DCT of size of 50K for Activity Net and 6K for You Cook2. Overall, we use DCT to finetune Video LLMs, enhancing their performance on fine-grained video understanding tasks, including DVC and its sub-tasks. Further details are in the supplement. ... Visualization is done on Activity Net validation set with VTime LLM in Pc t path. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for conducting the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks used in the implementation. |
| Experiment Setup | No | The paper mentions hyperparameters like 'β' and 'γ' but does not provide their specific values or other critical experimental setup details such as learning rate, batch size, optimizer configuration, or number of epochs. |