IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Authors: Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, Ping Luo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies. |
| Researcher Affiliation | Collaboration | Yatai Ji1,2, Shilong Zhang1, Jie Wu3, Peize Sun1, Weifeng Chen3, Xuefeng Xiao3, Sidi Yang2, Yujiu Yang2 , Ping Luo1 1The University of Hong Kong, 2Tsinghua University, 3Byte Dance |
| Pseudocode | No | The paper describes methods and processes in narrative text and figures but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions testing 'open-source models' and proposes a new model (IDA-VLM) and benchmark (MM-ID), but it does not provide any explicit statement about releasing its own source code or a link to a code repository. |
| Open Datasets | Yes | The initial phase leverages annotations in datasets such as VCR (Zellers et al., 2019), Flickr30k (Plummer et al., 2017), and Ref COCO (Kazemzadeh et al., 2014) with our data configuration strategies... The subsequent phase utilizes Movie Net (Huang et al., 2020) to generate Q&A and caption instruction tuning data with GPT-4V (Open AI, 2023b)...This work utilized the Movie Net dataset, which is publicly available under open-source license for academic research. |
| Dataset Splits | Yes | The first stage tuning data contains approximately 80,000 samples, while the second one comprises around 60,000 samples...MMID comprises a collection of 585 diverse testing samples |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | Yes | We adopt Qwen-VL-Chat (Bai et al., 2023) as our baseline model...For Q&A and caption tasks, we adopt GPT-4V to convert annotations from the Movie Net dataset...GPT-4 is used to score the results...version: gpt-4-1106-preview |
| Experiment Setup | Yes | During IDA-VLM training, we set learning rate as 1e-5 for the first stage and 5e-6 for the second stage. The model is trained for 5 epochs in both the first and second stages. The batch size with gradient accumulation is set to 128. The visual encoder is fixed, while ID-Former and LLM are fine-tuned. |