Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking

Authors: Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, Shuxiang Song

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on nine benchmark datasets demonstrate that SSTrack surpasses SOTA selfsupervised tracking methods, achieving an improvement of more than 25.3%, 20.4%, and 14.8% in AUC (AO) score on the GOT10K, La SOT, Tracking Net datasets, respectively. Code https://github.com/GXNU-ZhongLab/SSTrack
Researcher Affiliation Academia 1Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China 2Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China
Pseudocode Yes Algorithm 1: The SSTrack training process Input: Initial frame and bounding box (Ir, Br); Search frame It=2:n s Output: Bt s in subsequent frames 1: // Forward tracking 2: for t = 2 to n do 3: Bt s = E (It s, (Ir, Br)) 4: Crop It s based on Bt s yields a new reference frame It sr 5: end for 6: // Backward tracking 7: Expand Ir to get multiple new views I1:m r 8: for t = 1 to m do 9: Bt r = E (It r, {(Isr, Bsr)}n) 10: end for 11: // Tracking and contrastive losses 12: Calculate loss using Eq.6 and update parameters. 13: return Bt s
Open Source Code Yes Code https://github.com/GXNU-ZhongLab/SSTrack
Open Datasets Yes The training data includes La SOT (Fan et al. 2019), GOT-10k (Huang, Zhao, and Huang 2021), Tracking Net (M uller et al. 2018), and COCO (Lin et al. 2014).
Dataset Splits Yes La SOT is a classic long-term tracking benchmark, comprising 1120 training sequences and 280 test sequences. As shown in the Tab.2, compared to the selfsupervised method TADS, our method improves the success, normalized precision, and precision score by 20.4%, 22.2%, and 25.9%, respectively.
Hardware Specification Yes The model is conducted on a server with two 80GB Tesla A100 GPUs and set the batch size to be 8.
Software Dependencies No The paper mentions specific models and optimizers like "Vi T-Base (Dosovitskiy et al. 2021)", "Drop MAE (Wu et al. 2023)", and "Adam W (Loshchilov and Hutter 2019)", but does not provide specific version numbers for underlying software libraries such as Python, PyTorch, or CUDA.
Experiment Setup Yes The Adam W (Loshchilov and Hutter 2019) is used to endto-end optimize model parameters with initial learning rate of 2.5 10 5 for the backbone, 2.5 10 4 for the rest, and set the weight decay to 10 4. The training epochs is set to 150 epochs. 10k image pairs are randomly sampled in each epoch. The learning rate drops by a factor of 10 after 120 epochs. The model is conducted on a server with two 80GB Tesla A100 GPUs and set the batch size to be 8.