Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models

Authors: Jisheng Dang, Ligen Chen, Jingze Wu, Ronghao Lin, Bimei Wang, Yun Wang, Liting Wang, Nannan Zhu, Teng Wang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on benchmark datasets demonstrate our approach s state-of-the-art performance across multiple video understanding tasks. These results establish diffusion models as a powerful tool for enhancing multimodal video models in complex, dynamic scenarios. Experiments demonstrate that Diff-LMM achieves stateof-the-art performance across multiple long video understanding benchmarks. Ablation analysis further confirms that visual representations from pre-trained diffusion models, such as Di T, offer a positive effect in fine-grained tasks within long video scenarios.
Researcher Affiliation Academia 1 Sun Yat-sen University, Guangdong, China 2 Lanzhou University, Gansu, China 3 National University of Singapore, Singapore 4 Jinan University, Guangdong, China 5 City University of Hong Kong, China 6 Northwest Normal University, Gansu, China 7 University of Hong Kong, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology in text and with a system overview diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about the release of source code, nor does it include any links to a code repository.
Open Datasets Yes We conduct experiments on the LVU dataset [Wu and Krahenbuhl, 2021], which contains approximately 30K video clips from 3K movies, each with a duration ranging from 1 to 3 minutes. [...] To further compare with existing multimodal methods, we also extend the evaluation to the MSVD-QA [Xu et al., 2017], a standard open-ended video question-answering dataset, consisting of short videos lasting 10-15 seconds.
Dataset Splits No The paper refers to using benchmark datasets like LVU [Wu and Krahenbuhl, 2021] and MSVD-QA [Xu et al., 2017] for evaluation, implying standard splits are used. However, it does not explicitly provide specific percentages, sample counts, or detailed methodologies for training, validation, and test splits within the main text.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using the Vicuna-7B model, Adam W optimizer, and Di T diffusion model. However, it does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes This study utilizes the Vicuna-7B model as the LLM. It is trained over 20 epochs with a learning rate of 1 10 4 and a batch size of 64. The Adam W optimizer is applied with β1 = 0.9 and β2 = 0.999 for the hyperparameters, and a weight decay of 0.05. Input images are resized to 224 224 pixels. As for the diffusion model, we follow the REPA [Yu et al., 2024] to use the eighth layer output of the diffusion model (Di T-XL-2-256 256) as the supervisory signal y . The input images of diffusion models is resized to 256 256 pixels. We assign a class label of 1000 (no class), and initialize the timestep to 0 (no noise) before inputting the images into the pre-trained, frozen Di T. On the LVU dataset [Wu and Krahenbuhl, 2021], we set λ to 1, while on the MSVD-QA dataset [Xu et al., 2017], λ is set to 0.0005. During the decoding phase, a beam search width of 5 is employed. The frame length is 100, while the memory bank length is set to 20. For Language-Video Understanding (LVU) tasks, the prompt format is What is the < task > of the movie? , where the task represents relationship, speaking style, scene, director, genre, writer, and release year. For evaluation, we choose the widely used top-1 accuracy as the primary metric.