Unhackable Temporal Reward for Scalable Video MLLMs
Authors: En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development. Project page: https://Ahnsun.github.io/UTR/. |
| Researcher Affiliation | Collaboration | 1Huazhong University of Science and Technology 2Beijing University of Posts and Telecommunications 3Step Fun 4Johns Hopkins University 5University of Chinese Academy of Sciences |
| Pseudocode | No | The paper describes the methodology using text and diagrams (Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://Ahnsun.github.io/UTR/. |
| Open Datasets | Yes | Datasets. We primarily construct UTR-Data using several existing open-source video datasets, namely How To100M (Miech et al., 2019), Me Vi S (Ding et al., 2023), and La MOT (Li et al., 2024e). |
| Dataset Splits | Yes | Using the standard MLLM evaluation framework and the LLMs-Eval tool (Zhang et al., 2024a), we assessed major image and video understanding tasks. Results are shown in Tables 1 and 2. For video understanding, we focused on three general benchmarks: MVBench (Li et al., 2024c), Temp Compass (Liu et al., 2024c), and Video MME (Fu et al., 2024), as well as four video QA benchmarks: MVSD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2016), TGIF-QA (Jang et al., 2017), and Activity Net-QA (Caba Heilbron et al., 2015). |
| Hardware Specification | Yes | Machine NVIDIA Tesla A800 80GB GPU x 64 |
| Software Dependencies | No | The paper mentions specific models and frameworks used (e.g., LLa VA-Ne XT-Video, Sig LIP-L, QWen-2) but does not provide specific version numbers for general software libraries or programming languages (e.g., Python, PyTorch version). |
| Experiment Setup | Yes | Table 9: Training hyperparameters of Video-UTR. The hyperparameter placed in the middle indicates that this hyperparameter is used in both stages. |