Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking
Authors: Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, Jian Yang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. |
| Researcher Affiliation | Academia | 1PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology 2Nanjing University 3Guangxi Normal University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures (Figure 2 and Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code https://github.com/NJU-PCALab/STTrack |
| Open Datasets | Yes | The proposed STTrack achieves state-of-the-art performance on five popular multimodal tracking benchmarks, including RGBT234, Las He R, Vis Ev Ent, Depthtrack, and VOT-RGBD2022. |
| Dataset Splits | Yes | Vis Event is the largest RGB-E dataset, encompassing 500 training video sequences and 320 testing video sequences. |
| Hardware Specification | Yes | The training was conducted on four NVIDIA Tesla A6000 GPUs over 15 epochs... The tracking speed, tested on a NVIDIA 4090 GPU, is approximately 35.5 FPS. |
| Software Dependencies | No | Adam W (Loshchilov and Hutter 2018) was employed as the optimizer, with an initial learning rate of 1e 5 for the Vi T backbone and 1e 4 for other parameters. This mentions an optimizer but does not specify software versions for programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | The training was conducted on four NVIDIA Tesla A6000 GPUs over 15 epochs, with each epoch consisting of 60,000 sample pairs and a batch size of 32. Adam W (Loshchilov and Hutter 2018) was employed as the optimizer, with an initial learning rate of 1e 5 for the Vi T backbone and 1e 4 for other parameters. After 10 epochs, the learning rate was reduced by a factor of 10. |