IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Authors: Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, Ping Luo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.
Researcher Affiliation Collaboration Yatai Ji1,2, Shilong Zhang1, Jie Wu3, Peize Sun1, Weifeng Chen3, Xuefeng Xiao3, Sidi Yang2, Yujiu Yang2 , Ping Luo1 1The University of Hong Kong, 2Tsinghua University, 3Byte Dance
Pseudocode No The paper describes methods and processes in narrative text and figures but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions testing 'open-source models' and proposes a new model (IDA-VLM) and benchmark (MM-ID), but it does not provide any explicit statement about releasing its own source code or a link to a code repository.
Open Datasets Yes The initial phase leverages annotations in datasets such as VCR (Zellers et al., 2019), Flickr30k (Plummer et al., 2017), and Ref COCO (Kazemzadeh et al., 2014) with our data configuration strategies... The subsequent phase utilizes Movie Net (Huang et al., 2020) to generate Q&A and caption instruction tuning data with GPT-4V (Open AI, 2023b)...This work utilized the Movie Net dataset, which is publicly available under open-source license for academic research.
Dataset Splits Yes The first stage tuning data contains approximately 80,000 samples, while the second one comprises around 60,000 samples...MMID comprises a collection of 585 diverse testing samples
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies Yes We adopt Qwen-VL-Chat (Bai et al., 2023) as our baseline model...For Q&A and caption tasks, we adopt GPT-4V to convert annotations from the Movie Net dataset...GPT-4 is used to score the results...version: gpt-4-1106-preview
Experiment Setup Yes During IDA-VLM training, we set learning rate as 1e-5 for the first stage and 5e-6 for the second stage. The model is trained for 5 epochs in both the first and second stages. The batch size with gradient accumulation is set to 128. The visual encoder is fixed, while ID-Former and LLM are fine-tuned.