TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding

Authors: Zhanyu Wang, Chen Tang, Haoyu He, Kuan Feng, Chao Wang, Bingni Zhang, Xiaolei XU, SHEN WANG, Luping Zhou

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both variants achieve state-of-the-art or competitive results on a wide range of image and video benchmarks while markedly improving token efficiency. Code is publicly available at: https://github.com/wang-zhanyu/Temp Flex. [...] 4 Experiments
Researcher Affiliation Collaboration Zhanyu Wang EMAIL Byte Dance [...] Luping Zhou EMAIL The University of Sydney
Pseudocode No No explicit pseudocode or algorithm blocks are provided. The methodology for Temporal Fiber Fusion is described using mathematical equations and textual explanations in Section 3.2, but not in a structured pseudocode format.
Open Source Code Yes Code is publicly available at: https://github.com/wang-zhanyu/Temp Flex. [...] with all code, models, and data to be released publicly.
Open Datasets Yes To support large-scale video-language pretraining, we curate Temp Flex-2M, a high-quality synthetic video text corpus generated in a single stage via GPT-4o with direct visual prompting. [...] Our resulting dataset, Temp Flex-2M, comprises 210K curated and de-duplicated videos sourced from Fine Video (Farré et al., 2024), Open Vid-1M (Nan et al., 2024), Vatex (Wang et al., 2019), and Virpt (Yang et al., 2024). [...] with all code, models, and data to be released publicly.
Dataset Splits No No explicit training/validation/test splits are provided for the combined dataset or their created Temp Flex-2M dataset. The paper details the composition of training data for different stages and mentions using existing benchmarks for evaluation, but does not specify how its own data is partitioned into train/test/validation sets for reproducibility in that regard.
Hardware Specification Yes All models are trained on 128 NVIDIA H100 GPUs with 80GB VRAM in four distinct stages:
Software Dependencies No No specific version numbers for software dependencies like Python, PyTorch, or CUDA are provided. The paper mentions training tools like Deep Speed Ze RO-2 and Flash Attention, and an evaluation framework VLMEval Kit, but without versioning information.
Experiment Setup Yes Stage 1: Visual-Language Alignment and Encoder Adaptation. We jointly fine-tune the visual encoder and the MLP projector, setting the learning rates to 5e-5 and 1e-3, respectively. [...] Stage 4: Video Instruction Fine-Tuning. All model parameters are updated in this stage. The visual encoder is trained with a learning rate of 2e-6, and the rest of the model with 1e-5. [...] Stage 4 is trained for 2 epochs with a batch size of 256, while all other stages use a batch size of 512 and are trained for 1 epoch. We employ a cosine learning rate scheduler with a warmup ratio of 0.03.