Exploring Efficient and Effective Sequence Learning for Visual Object Tracking
Authors: Dongdong Li, Zhinan Gao, Yangliu Kuai, Rui Chen
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple benchmarks demonstrate that Fast Seq Track runs over 100 fps and showcases superior performance against state-of-the-art trackers. |
| Researcher Affiliation | Academia | Dongdong Li , Zhinan Gao , Yangliu Kuai , Rui Chen National University of Defense Technology EMAIL |
| Pseudocode | Yes | Algorithm 1 Decoder With Early Exits |
| Open Source Code | Yes | Codes and models are available at https://github.com/vision4drones/Fast Seq Track. |
| Open Datasets | Yes | Our training data includes the training splits of COCO [Lin et al., 2014], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], Tracking Net [Muller et al., 2018] and Vast Track [Peng et al., 2024]. |
| Dataset Splits | Yes | Our training data includes the training splits of COCO [Lin et al., 2014], La SOT [Fan et al., 2019], GOT10k [Huang et al., 2019], Tracking Net [Muller et al., 2018] and Vast Track [Peng et al., 2024]. The test set comprises 3500 videos from 2115 classes. On the large-scale TNL2K benchmark, Fast Seq Track obtains the third best performance with 56.86% AUC score as reported in Tab. 2. Tracking Net [Muller et al., 2018] is a relatively smaller dataset covering diverse object categories and scenes with 511 videos in the test set. GOT-10k [Huang et al., 2019] test set contains 180 videos covering various common tracking challenges. |
| Hardware Specification | Yes | The speed is measured on Intel Xeon Gold 6354 CPU @ 3.00GHz with 64 GB RAM and a single 4090 GPU with 24GB memory. The training of Fast Seq Track is conducted on 2 Intel Xeon Gold 6354 CPU @ 3.00GHz with 64 GB RAM and 8 4090 GPUs with 24GB memory. |
| Software Dependencies | Yes | All the models are implemented with Python 3.8 and Py Torch 1.11.0. |
| Experiment Setup | Yes | The input resolution of the template image and search image are 256 256. The patch size is set to 16 16. The decoder consists of 2 transformer blocks. Each GPU holds 16 image pairs, resulting in a total batch size of 128. The regularization parameters λce, λiou R in Equ. 1 are set to 1 and 5 respectively. The model is trained for a total of 500 epochs with 60k image pairs per epoch. The learning rate decreases by a factor of 10 after 400 epochs. The online template update interval is set to 1 by default, while the threshold τ in Algorithm 1 is set to 1.6. The vocabulary size is set to 4000. |