Robust Tracking via Mamba-based Context-aware Token Learning

Authors: Jinxia Xie, Bineng Zhong, Qihua Liang, Ning Li, Zhiyi Mo, Shuxiang Song

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show our method is effective and achieves competitive performance on multiple benchmarks at a real-time speed.
Researcher Affiliation Academia 1Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China 2Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China 3Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University, Wuzhou 543002, China xie EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the proposed method in text and uses diagrams (Figure 2, Figure 3) to illustrate the architecture and flow, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/GXNU-Zhong Lab/Tem Track
Open Datasets Yes Following the mainstream trackers, we use four datasets for training, including COCO(Lin et al. 2014), La SOT(Fan et al. 2019), Tracking Net(M uller et al. 2018), and GOT-10k(Huang, Zhao, and Huang 2021). We also evaluate our tracker on two additional benchmarks: UAV123 and TNL2K.
Dataset Splits Yes La SOT (Fan et al. 2019) is a high-quality benchmark for long-term challenge on single object tracking. It consists of 1120 sequences for training and 280 sequences for testing. ... GOT-10k is a large high-diversity benchmark for generic object tracking, which introduces a one-shot protocol for evaluation, i.e., the training and test classes are zero-overlapped. ... Following (Xie et al. 2024) and (Shi et al. 2024), we sample n video clips for each GPU, which contain m images as search images (all of them with the same template). So each GPU holds n m image pairs, i.e., the batch size is n m. We keep the batch size equal to 32. For four GPUs, the total batch size is 128. Obviously, m is the size of the sliding window and the length of temporal information. In Tem Track, n and m are 4 and 8, respectively.
Hardware Specification Yes The training is on 4 NVIDIA A10 GPUs and the speed evaluation is on a single NVIDIA V100 GPU.
Software Dependencies Yes Our tracker is implemented in Python 3.8 using Py Torch 1.13.1.
Experiment Setup Yes We train Tem Track with Adam W optimizer(Loshchilov and Hutter 2019). The learning rate of the backbone is 4 10 5, and the learning rate of other parameters is 4 10 4, and the weight decay is 10 4. ... We keep the batch size equal to 32. For four GPUs, the total batch size is 128. ... We train the Tem Track with 150 epochs and 60k image pairs for each epoch. We decrease the learning rate by the factor of 10 after the 120th epoch. For the GOT-10k benchmark, we train the model with only 40 epochs and the learning rate decays at 80% epochs. ... We use focal loss(Lin et al. 2017) for classification and combine GIo U loss(Rezatofighi et al. 2019) and L1 loss for regression. The total loss L is calculated as eq. (6), which λgiou = 2 and λL1 = 5.