Exploring Efficient and Effective Sequence Learning for Visual Object Tracking

Authors: Dongdong Li, Zhinan Gao, Yangliu Kuai, Rui Chen

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple benchmarks demonstrate that Fast Seq Track runs over 100 fps and showcases superior performance against state-of-the-art trackers.
Researcher Affiliation Academia Dongdong Li , Zhinan Gao , Yangliu Kuai , Rui Chen National University of Defense Technology EMAIL
Pseudocode Yes Algorithm 1 Decoder With Early Exits
Open Source Code Yes Codes and models are available at https://github.com/vision4drones/Fast Seq Track.
Open Datasets Yes Our training data includes the training splits of COCO [Lin et al., 2014], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], Tracking Net [Muller et al., 2018] and Vast Track [Peng et al., 2024].
Dataset Splits Yes Our training data includes the training splits of COCO [Lin et al., 2014], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], Tracking Net [Muller et al., 2018] and Vast Track [Peng et al., 2024]. The test set comprises 3500 videos from 2115 classes. On the large-scale TNL2K benchmark, Fast Seq Track obtains the third best performance with 56.86% AUC score as reported in Tab. 2. Tracking Net [Muller et al., 2018] is a relatively smaller dataset covering diverse object categories and scenes with 511 videos in the test set. GOT-10k [Huang et al., 2019] test set contains 180 videos covering various common tracking challenges.
Hardware Specification Yes The speed is measured on Intel Xeon Gold 6354 CPU @ 3.00GHz with 64 GB RAM and a single 4090 GPU with 24GB memory. The training of Fast Seq Track is conducted on 2 Intel Xeon Gold 6354 CPU @ 3.00GHz with 64 GB RAM and 8 4090 GPUs with 24GB memory.
Software Dependencies Yes All the models are implemented with Python 3.8 and Py Torch 1.11.0.
Experiment Setup Yes The input resolution of the template image and search image are 256 256. The patch size is set to 16 16. The decoder consists of 2 transformer blocks. Each GPU holds 16 image pairs, resulting in a total batch size of 128. The regularization parameters λce, λiou R in Equ. 1 are set to 1 and 5 respectively. The model is trained for a total of 500 epochs with 60k image pairs per epoch. The learning rate decreases by a factor of 10 after 400 epochs. The online template update interval is set to 1 by default, while the threshold τ in Algorithm 1 is set to 1.6. The vocabulary size is set to 4000.