Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Authors: Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, Jian Yang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios.
Researcher Affiliation Academia 1PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology 2Nanjing University 3Guangxi Normal University EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and figures (Figure 2 and Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code https://github.com/NJU-PCALab/STTrack
Open Datasets Yes The proposed STTrack achieves state-of-the-art performance on five popular multimodal tracking benchmarks, including RGBT234, Las He R, Vis Ev Ent, Depthtrack, and VOT-RGBD2022.
Dataset Splits Yes Vis Event is the largest RGB-E dataset, encompassing 500 training video sequences and 320 testing video sequences.
Hardware Specification Yes The training was conducted on four NVIDIA Tesla A6000 GPUs over 15 epochs... The tracking speed, tested on a NVIDIA 4090 GPU, is approximately 35.5 FPS.
Software Dependencies No Adam W (Loshchilov and Hutter 2018) was employed as the optimizer, with an initial learning rate of 1e 5 for the Vi T backbone and 1e 4 for other parameters. This mentions an optimizer but does not specify software versions for programming languages, libraries, or frameworks.
Experiment Setup Yes The training was conducted on four NVIDIA Tesla A6000 GPUs over 15 epochs, with each epoch consisting of 60,000 sample pairs and a batch size of 32. Adam W (Loshchilov and Hutter 2018) was employed as the optimizer, with an initial learning rate of 1e 5 for the Vi T backbone and 1e 4 for other parameters. After 10 epochs, the learning rate was reduced by a factor of 10.