A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

Authors: Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng, Ken Fukuda, Teruko Mitamura

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.
Researcher Affiliation Academia 1National Institute of Advanced Industrial Science and Technology (AIST) 2Language Technologies Institute, Carnegie Mellon University EMAIL, EMAIL
Pseudocode No The paper describes methods and processes, but does not include any clearly labeled pseudocode or algorithm blocks. The steps for data collection and evaluation are described in paragraph form.
Open Source Code Yes Resources https://github.com/aistairc/VDAct
Open Datasets Yes This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. ... Resources https://github.com/aistairc/VDAct
Dataset Splits Yes Data Splits Our dataset comprises 3,000 dialogues created from 1,000 scenarios, with each scenario created by three pairs of annotators. To prevent the occurrence of dialogues based on identical scenarios across training and validation/test sets, the dataset is split at the scenario level. We set the fraction of the train, test, and validation as 80% (2,400 dialogues), 15% (450), and 5% (150), respectively.
Hardware Specification No The paper mentions using 'state-of-the-art large-scale vision-language foundation models' and 'proprietary models GPT-4o and Gemini-1.5-pro' and discusses 'fine-tuning with Lo RA', but it does not specify any concrete hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions using specific models like 'GPT-4o-mini' for LLM-based evaluation metrics and 'Video-LLa VA', 'Video Chat GPT', and 'Video LLa MA2' as baselines. However, it does not provide specific version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow) or programming languages (e.g., Python).
Experiment Setup No The paper states 'For fine-tuning with Lo RA and model inferences, we report the parameter settings on the supplementary material.' and mentions 'extracting a fixed number of frames at regular intervals.' but does not include specific hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text.