reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Authors: Muhammet Furkan Ilaslan, Ali Köksal, Kevin Qinghong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset. VG-TVP’s results show superior performance in the zero-shot setting compared to several baselines under the Daily-PP dataset.
Researcher Affiliation	Academia	1Show Lab, National University of Singapore, Singapore 2Institute for Infocomm Research, Agency for Science, Technology, and Research (A*STAR), Singapore EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes the methodology and prompts used, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	Dataset https://github.com/mfurkanilaslan/VG-TVP. This link is explicitly for the dataset, not the source code for the methodology described in the paper.
Open Datasets	Yes	To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences... Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset. Dataset https://github.com/mfurkanilaslan/VG-TVP
Dataset Splits	Yes	We use a Win-Tie-Lose comparison on 50 seen and 15 unseen tasks, involving 28 human subjects for benchmarking. VG-TVP generated 2,504 videos for seen and 687 for unseen tasks, while baselines produced 2,701 and 681 vanilla textual videos, respectively. The Daily-PP consists of 5 domains (Breakfast, Dinner, Drink, Hobby&Crafts, and Home&Garage), 50 seen tasks, and 15 unseen tasks.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU specifications) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Chat GPT 3.5' and 'Chat GPT4o', and also references 'VLog model (Lin and Lei 2023)', 'BLIP2 (Li et al. 2023)', 'GRIT (Du, Rush, and Cardie 2021)', 'Whisper (Radford et al. 2023)' models, and 'Model Scope (Wang et al. 2023b)'. While these are specific models/tools, explicit version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch) are not provided.
Experiment Setup	No	The paper describes the overall VG-TVP approach and mentions leveraging 'zero-shot reasoning ability of LLMs' and 'step-by-step prompting template' but does not provide specific numerical hyperparameters such as learning rates, batch sizes, or number of epochs for training any component.