VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting
Authors: Muhammet Furkan Ilaslan, Ali Köksal, Kevin Qinghong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset. VG-TVP’s results show superior performance in the zero-shot setting compared to several baselines under the Daily-PP dataset. |
| Researcher Affiliation | Academia | 1Show Lab, National University of Singapore, Singapore 2Institute for Infocomm Research, Agency for Science, Technology, and Research (A*STAR), Singapore EMAIL, EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology and prompts used, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Dataset https://github.com/mfurkanilaslan/VG-TVP. This link is explicitly for the dataset, not the source code for the methodology described in the paper. |
| Open Datasets | Yes | To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences... Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset. Dataset https://github.com/mfurkanilaslan/VG-TVP |
| Dataset Splits | Yes | We use a Win-Tie-Lose comparison on 50 seen and 15 unseen tasks, involving 28 human subjects for benchmarking. VG-TVP generated 2,504 videos for seen and 687 for unseen tasks, while baselines produced 2,701 and 681 vanilla textual videos, respectively. The Daily-PP consists of 5 domains (Breakfast, Dinner, Drink, Hobby&Crafts, and Home&Garage), 50 seen tasks, and 15 unseen tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Chat GPT 3.5' and 'Chat GPT4o', and also references 'VLog model (Lin and Lei 2023)', 'BLIP2 (Li et al. 2023)', 'GRIT (Du, Rush, and Cardie 2021)', 'Whisper (Radford et al. 2023)' models, and 'Model Scope (Wang et al. 2023b)'. While these are specific models/tools, explicit version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch) are not provided. |
| Experiment Setup | No | The paper describes the overall VG-TVP approach and mentions leveraging 'zero-shot reasoning ability of LLMs' and 'step-by-step prompting template' but does not provide specific numerical hyperparameters such as learning rates, batch sizes, or number of epochs for training any component. |