reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Authors: Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, Ping Luo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.
Researcher Affiliation	Collaboration	Yatai Ji1,2, Shilong Zhang1, Jie Wu3, Peize Sun1, Weifeng Chen3, Xuefeng Xiao3, Sidi Yang2, Yujiu Yang2 , Ping Luo1 1The University of Hong Kong, 2Tsinghua University, 3Byte Dance
Pseudocode	No	The paper describes methods and processes in narrative text and figures but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions testing 'open-source models' and proposes a new model (IDA-VLM) and benchmark (MM-ID), but it does not provide any explicit statement about releasing its own source code or a link to a code repository.
Open Datasets	Yes	The initial phase leverages annotations in datasets such as VCR (Zellers et al., 2019), Flickr30k (Plummer et al., 2017), and Ref COCO (Kazemzadeh et al., 2014) with our data configuration strategies... The subsequent phase utilizes Movie Net (Huang et al., 2020) to generate Q&A and caption instruction tuning data with GPT-4V (Open AI, 2023b)...This work utilized the Movie Net dataset, which is publicly available under open-source license for academic research.
Dataset Splits	Yes	The first stage tuning data contains approximately 80,000 samples, while the second one comprises around 60,000 samples...MMID comprises a collection of 585 diverse testing samples
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	Yes	We adopt Qwen-VL-Chat (Bai et al., 2023) as our baseline model...For Q&A and caption tasks, we adopt GPT-4V to convert annotations from the Movie Net dataset...GPT-4 is used to score the results...version: gpt-4-1106-preview
Experiment Setup	Yes	During IDA-VLM training, we set learning rate as 1e-5 for the first stage and 5e-6 for the second stage. The model is trained for 5 epochs in both the first and second stages. The batch size with gradient accumulation is set to 128. The visual encoder is fixed, while ID-Former and LLM are fine-tuned.