TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding
Authors: Zhanyu Wang, Chen Tang, Haoyu He, Kuan Feng, Chao Wang, Bingni Zhang, Xiaolei XU, SHEN WANG, Luping Zhou
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both variants achieve state-of-the-art or competitive results on a wide range of image and video benchmarks while markedly improving token efficiency. Code is publicly available at: https://github.com/wang-zhanyu/Temp Flex. [...] 4 Experiments |
| Researcher Affiliation | Collaboration | Zhanyu Wang EMAIL Byte Dance [...] Luping Zhou EMAIL The University of Sydney |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided. The methodology for Temporal Fiber Fusion is described using mathematical equations and textual explanations in Section 3.2, but not in a structured pseudocode format. |
| Open Source Code | Yes | Code is publicly available at: https://github.com/wang-zhanyu/Temp Flex. [...] with all code, models, and data to be released publicly. |
| Open Datasets | Yes | To support large-scale video-language pretraining, we curate Temp Flex-2M, a high-quality synthetic video text corpus generated in a single stage via GPT-4o with direct visual prompting. [...] Our resulting dataset, Temp Flex-2M, comprises 210K curated and de-duplicated videos sourced from Fine Video (Farré et al., 2024), Open Vid-1M (Nan et al., 2024), Vatex (Wang et al., 2019), and Virpt (Yang et al., 2024). [...] with all code, models, and data to be released publicly. |
| Dataset Splits | No | No explicit training/validation/test splits are provided for the combined dataset or their created Temp Flex-2M dataset. The paper details the composition of training data for different stages and mentions using existing benchmarks for evaluation, but does not specify how its own data is partitioned into train/test/validation sets for reproducibility in that regard. |
| Hardware Specification | Yes | All models are trained on 128 NVIDIA H100 GPUs with 80GB VRAM in four distinct stages: |
| Software Dependencies | No | No specific version numbers for software dependencies like Python, PyTorch, or CUDA are provided. The paper mentions training tools like Deep Speed Ze RO-2 and Flash Attention, and an evaluation framework VLMEval Kit, but without versioning information. |
| Experiment Setup | Yes | Stage 1: Visual-Language Alignment and Encoder Adaptation. We jointly fine-tune the visual encoder and the MLP projector, setting the learning rates to 5e-5 and 1e-3, respectively. [...] Stage 4: Video Instruction Fine-Tuning. All model parameters are updated in this stage. The visual encoder is trained with a learning rate of 2e-6, and the rest of the model with 1e-5. [...] Stage 4 is trained for 2 epochs with a batch size of 256, while all other stages use a batch size of 512 and are trained for 1 epoch. We employ a cosine learning rate scheduler with a warmup ratio of 0.03. |