reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

Authors: Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng, Ken Fukuda, Teruko Mitamura

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.
Researcher Affiliation	Academia	1National Institute of Advanced Industrial Science and Technology (AIST) 2Language Technologies Institute, Carnegie Mellon University EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes, but does not include any clearly labeled pseudocode or algorithm blocks. The steps for data collection and evaluation are described in paragraph form.
Open Source Code	Yes	Resources https://github.com/aistairc/VDAct
Open Datasets	Yes	This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. ... Resources https://github.com/aistairc/VDAct
Dataset Splits	Yes	Data Splits Our dataset comprises 3,000 dialogues created from 1,000 scenarios, with each scenario created by three pairs of annotators. To prevent the occurrence of dialogues based on identical scenarios across training and validation/test sets, the dataset is split at the scenario level. We set the fraction of the train, test, and validation as 80% (2,400 dialogues), 15% (450), and 5% (150), respectively.
Hardware Specification	No	The paper mentions using 'state-of-the-art large-scale vision-language foundation models' and 'proprietary models GPT-4o and Gemini-1.5-pro' and discusses 'fine-tuning with Lo RA', but it does not specify any concrete hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions using specific models like 'GPT-4o-mini' for LLM-based evaluation metrics and 'Video-LLa VA', 'Video Chat GPT', and 'Video LLa MA2' as baselines. However, it does not provide specific version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow) or programming languages (e.g., Python).
Experiment Setup	No	The paper states 'For fine-tuning with Lo RA and model inferences, we report the parameter settings on the supplementary material.' and mentions 'extracting a fixed number of frames at regular intervals.' but does not include specific hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text.