TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Authors: Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, pengzhang liu, Yongjun Bao, Guiguang Ding

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the superiority of our Temp Me. Compared to previous parameter-efficient text-video retrieval methods, Temp Me achieves superior performance with just 0.50M trainable parameters. It significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8 speedup and a 4.4% R-Sum improvement. Extensive experiments are conducted on four benchmark datasets, MSRVTT Xu et al. (2016), Activity Net Krishna et al. (2017), Di De Mo Anne Hendricks et al. (2017), and LSMDC Rohrbach et al. (2015). Experimental results consistently demonstrate that our Temp Me offers a leading balance between speed and accuracy, outperforming other efficient fine-tuning methods. Figure 1d shows the performance comparison of CLIP-Vi T-B/16 on MSRVTT.
Researcher Affiliation Collaboration 1 School of Software, Tsinghua University 2 BNRist, Tsinghua University 3 Hangzhou Zhuoxi Institute of Brain and Intelligence 4 JD.com 5 GRG Banking Equipment Co., Ltd. 6 South China University of Technology
Pseudocode No The paper describes the methodology using prose, figures (Figure 2, Figure 3), and mathematical equations (Equation 1, 2, 3, 4), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Lunar Shen/Temp Me.
Open Datasets Yes Extensive experiments are conducted on four benchmark datasets, MSRVTT Xu et al. (2016), Activity Net Krishna et al. (2017), Di De Mo Anne Hendricks et al. (2017), and LSMDC Rohrbach et al. (2015).
Dataset Splits Yes Following the data split in Gabeur et al. (2020); Miech et al. (2019), we train models on 9000 train+val videos with the corresponding captions and test on the 1K-A test set with 1000 video-text pairs. (2) Activity Net Krishna et al. (2017) [...] We evaluate models on the val1 split, comprising 10,009 videos for training and 4,917 for testing. (3) Di De Mo Anne Hendricks et al. (2017) [...] There are 8,395 videos in the train set and 1,004 videos in the test set. (4) LSMDC Rohrbach et al. (2015) [...] 101,079 videos are used for training. 7,408 and 1,000 videos are used for validation and testing, respectively.
Hardware Specification Yes All throughputs are measured on a A100. [...] For all experiments, memory usage during training and inference is measured on 4 A100 GPUs, each processing a batch size of 32.
Software Dependencies No The paper mentions using 'Adam W optimizer' and 'cosine learning rate schedule' but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes We employ the Adam W optimizer Loshchilov & Hutter (2016) with a batch size of 128. The initial learning rate is set to 6e-4 with a cosine learning rate schedule Goyal et al. (2017) for 5 epochs. The dimension of Lo RA is set to 8 in all experiments. The Img Me Block employs r = 2 for CLIP-Vi T-B/32 and r = 10 for CLIP-Vi T-B/16. The Clip Me Block employs RC = 70%, RI = 90% for Vi T-B/32 and RC = 60%, RI = 80% for CLIP-Vi T-B/16.