Efficient Motion Prompt Learning for Robust Visual Tracking
Authors: Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. |
| Researcher Affiliation | Academia | 1Dalian University of Technology 2City University of Hong Kong 3Link oping University. |
| Pseudocode | No | The paper describes methods using textual descriptions and figures, such as Figure 2 for the pipeline, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ zj5559/Motion-Prompt-Tracking. |
| Open Datasets | Yes | We select the training splits of La SOT (Fan et al., 2019), GOT10K (Huang et al., 2019)2, and Tracking Net (Muller et al., 2018) as the training data. ... We compare our methods with baselines and other SOTA trackers on the following seven tracking benchmarks. VOT. ... VOT2018 (Kristan et al., 2018), VOT2020 (Kristan et al., 2020), and VOT2022 (Kristan et al., 2022). La SOT and La SOTEXT. La SOT (Fan et al., 2019) ... La SOTEXT (Fan et al., 2021) ... TNL2K (Wang et al., 2021). Tracking Net (Muller et al., 2018). |
| Dataset Splits | Yes | We select the training splits of La SOT (Fan et al., 2019), GOT10K (Huang et al., 2019)2, and Tracking Net (Muller et al., 2018) as the training data. For the motion input, we adopt Di MP-18 (Bhat et al., 2019) to generate real tracking predictions for each of training sequences, and employ reverse sampling, sparse sampling and Cut Mix (Yun et al., 2019) for data augmentations. ... Following the VOT protocol, 1k sequences are removed. |
| Hardware Specification | Yes | Models are trained on 2 NVIDIA A100 GPUs, and tested on a single NVIDIA RTX2080Ti GPU. |
| Software Dependencies | No | Our methods are implemented in Python with Py Torch. |
| Experiment Setup | Yes | The length of the historical trajectory T is set to 30 based on experimental results. ... The lightweight fusion decoder is implemented as a two-layer Transformer network. The weight head Head W and motion head Head M are implemented by a two-layer MLP, where the hidden size is 256. ... The model is trained for 60 epochs with 60k image pairs per epoch. We set the batch size to 128, and the learning rate is decreased by a factor of 10 after 40 epochs. The initial learning rate and other training settings are set the same as corresponding baseline trackers. ... LM = λIo ULIo U + λℓ1L1, (4) where λIo U = 2 and λℓ1 = 5 in our experiments. ... L = LTr + λM(LM + LW), (6) where λM = 1 in our experiments. ... For each layer, the number of attention heads is 8, and the hidden size of MLP is set to 1024 and 256 for OSTrack / ARTrack and Seq Track, respectively. ... the best performance of our model is attained when the probability of Cut Mix is set to 0.5. ... The sparseness of 5 is an optimized choice, which is also our default setting. |