Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models
Authors: Jisheng Dang, Ligen Chen, Jingze Wu, Ronghao Lin, Bimei Wang, Yun Wang, Liting Wang, Nannan Zhu, Teng Wang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmark datasets demonstrate our approach s state-of-the-art performance across multiple video understanding tasks. These results establish diffusion models as a powerful tool for enhancing multimodal video models in complex, dynamic scenarios. Experiments demonstrate that Diff-LMM achieves stateof-the-art performance across multiple long video understanding benchmarks. Ablation analysis further confirms that visual representations from pre-trained diffusion models, such as Di T, offer a positive effect in fine-grained tasks within long video scenarios. |
| Researcher Affiliation | Academia | 1 Sun Yat-sen University, Guangdong, China 2 Lanzhou University, Gansu, China 3 National University of Singapore, Singapore 4 Jinan University, Guangdong, China 5 City University of Hong Kong, China 6 Northwest Normal University, Gansu, China 7 University of Hong Kong, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in text and with a system overview diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code, nor does it include any links to a code repository. |
| Open Datasets | Yes | We conduct experiments on the LVU dataset [Wu and Krahenbuhl, 2021], which contains approximately 30K video clips from 3K movies, each with a duration ranging from 1 to 3 minutes. [...] To further compare with existing multimodal methods, we also extend the evaluation to the MSVD-QA [Xu et al., 2017], a standard open-ended video question-answering dataset, consisting of short videos lasting 10-15 seconds. |
| Dataset Splits | No | The paper refers to using benchmark datasets like LVU [Wu and Krahenbuhl, 2021] and MSVD-QA [Xu et al., 2017] for evaluation, implying standard splits are used. However, it does not explicitly provide specific percentages, sample counts, or detailed methodologies for training, validation, and test splits within the main text. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using the Vicuna-7B model, Adam W optimizer, and Di T diffusion model. However, it does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | This study utilizes the Vicuna-7B model as the LLM. It is trained over 20 epochs with a learning rate of 1 10 4 and a batch size of 64. The Adam W optimizer is applied with β1 = 0.9 and β2 = 0.999 for the hyperparameters, and a weight decay of 0.05. Input images are resized to 224 224 pixels. As for the diffusion model, we follow the REPA [Yu et al., 2024] to use the eighth layer output of the diffusion model (Di T-XL-2-256 256) as the supervisory signal y . The input images of diffusion models is resized to 256 256 pixels. We assign a class label of 1000 (no class), and initialize the timestep to 0 (no noise) before inputting the images into the pre-trained, frozen Di T. On the LVU dataset [Wu and Krahenbuhl, 2021], we set λ to 1, while on the MSVD-QA dataset [Xu et al., 2017], λ is set to 0.0005. During the decoding phase, a beam search width of 5 is employed. The frame length is 100, while the memory bank length is set to 20. For Language-Video Understanding (LVU) tasks, the prompt format is What is the < task > of the movie? , where the task represents relationship, speaking style, scene, director, genre, writer, and release year. For evaluation, we choose the widely used top-1 accuracy as the primary metric. |