reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding

Authors: Zhanyu Wang, Chen Tang, Haoyu He, Kuan Feng, Chao Wang, Bingni Zhang, Xiaolei XU, SHEN WANG, Luping Zhou

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Both variants achieve state-of-the-art or competitive results on a wide range of image and video benchmarks while markedly improving token efficiency. Code is publicly available at: https://github.com/wang-zhanyu/Temp Flex. [...] 4 Experiments
Researcher Affiliation	Collaboration	Zhanyu Wang EMAIL Byte Dance [...] Luping Zhou EMAIL The University of Sydney
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided. The methodology for Temporal Fiber Fusion is described using mathematical equations and textual explanations in Section 3.2, but not in a structured pseudocode format.
Open Source Code	Yes	Code is publicly available at: https://github.com/wang-zhanyu/Temp Flex. [...] with all code, models, and data to be released publicly.
Open Datasets	Yes	To support large-scale video-language pretraining, we curate Temp Flex-2M, a high-quality synthetic video text corpus generated in a single stage via GPT-4o with direct visual prompting. [...] Our resulting dataset, Temp Flex-2M, comprises 210K curated and de-duplicated videos sourced from Fine Video (Farré et al., 2024), Open Vid-1M (Nan et al., 2024), Vatex (Wang et al., 2019), and Virpt (Yang et al., 2024). [...] with all code, models, and data to be released publicly.
Dataset Splits	No	No explicit training/validation/test splits are provided for the combined dataset or their created Temp Flex-2M dataset. The paper details the composition of training data for different stages and mentions using existing benchmarks for evaluation, but does not specify how its own data is partitioned into train/test/validation sets for reproducibility in that regard.
Hardware Specification	Yes	All models are trained on 128 NVIDIA H100 GPUs with 80GB VRAM in four distinct stages:
Software Dependencies	No	No specific version numbers for software dependencies like Python, PyTorch, or CUDA are provided. The paper mentions training tools like Deep Speed Ze RO-2 and Flash Attention, and an evaluation framework VLMEval Kit, but without versioning information.
Experiment Setup	Yes	Stage 1: Visual-Language Alignment and Encoder Adaptation. We jointly fine-tune the visual encoder and the MLP projector, setting the learning rates to 5e-5 and 1e-3, respectively. [...] Stage 4: Video Instruction Fine-Tuning. All model parameters are updated in this stage. The visual encoder is trained with a learning rate of 2e-6, and the rest of the model with 1e-5. [...] Stage 4 is trained for 2 epochs with a batch size of 256, while all other stages use a batch size of 512 and are trained for 1 epoch. We employ a cosine learning rate scheduler with a warmup ratio of 0.03.