A Video-grounded Dialogue Dataset and Metric for Event-driven Activities
Authors: Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng, Ken Fukuda, Teruko Mitamura
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns. |
| Researcher Affiliation | Academia | 1National Institute of Advanced Industrial Science and Technology (AIST) 2Language Technologies Institute, Carnegie Mellon University EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and processes, but does not include any clearly labeled pseudocode or algorithm blocks. The steps for data collection and evaluation are described in paragraph form. |
| Open Source Code | Yes | Resources https://github.com/aistairc/VDAct |
| Open Datasets | Yes | This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. ... Resources https://github.com/aistairc/VDAct |
| Dataset Splits | Yes | Data Splits Our dataset comprises 3,000 dialogues created from 1,000 scenarios, with each scenario created by three pairs of annotators. To prevent the occurrence of dialogues based on identical scenarios across training and validation/test sets, the dataset is split at the scenario level. We set the fraction of the train, test, and validation as 80% (2,400 dialogues), 15% (450), and 5% (150), respectively. |
| Hardware Specification | No | The paper mentions using 'state-of-the-art large-scale vision-language foundation models' and 'proprietary models GPT-4o and Gemini-1.5-pro' and discusses 'fine-tuning with Lo RA', but it does not specify any concrete hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific models like 'GPT-4o-mini' for LLM-based evaluation metrics and 'Video-LLa VA', 'Video Chat GPT', and 'Video LLa MA2' as baselines. However, it does not provide specific version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow) or programming languages (e.g., Python). |
| Experiment Setup | No | The paper states 'For fine-tuning with Lo RA and model inferences, we report the parameter settings on the supplementary material.' and mentions 'extracting a fixed number of frames at regular intervals.' but does not include specific hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text. |