Cross-modulated Attention Transformer for RGBT Tracking

Authors: Yun Xiao, Jiacong Zhao, Andong Lu, Chenglong Li, Bing Yin, Yin Lin, Cong Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.
Researcher Affiliation Collaboration 1 School of Artificial Intelligence, Anhui University, Hefei, China ... 3 i FLYTEK CO.LTD., Hefei, China
Pseudocode No The paper describes the method using mathematical formulations and descriptive text, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/opacity-black/CAFormer
Open Datasets Yes Experiments on five public RGBT tracking benchmarks... Our experiments are conducted on five public datasets: GTOT (Li et al. 2016), RGBT210 (Li et al. 2017), RGBT234 (Li et al. 2019a), Las He R (Li et al. 2021), and VTUAV (Pengyu et al. 2022).
Dataset Splits Yes We train our model for 10 epochs on the training set of Las He R (Li et al. 2021)... For GTOT (Li et al. 2016), RGBT210 (Li et al. 2017), and RGBT234 (Li et al. 2019a), we directly evaluate our model without any further fine-tuning. For VTUAV (Pengyu et al. 2022) dataset, we adopt the VTUAV training set for our training process, and adjust the number of training epochs to 5.
Hardware Specification Yes For the training process, CAFormer is trained on 2 NVIDIA 2080ti GPUs... Additionally, we complete the speed test on a device with an Nvidia RTX 3080ti GPU.
Software Dependencies No The paper mentions the use of 'Adam W (Loshchilov and Hutter 2017)' as the optimization algorithm, but does not specify versions for other key software components like programming languages or libraries.
Experiment Setup Yes In our method, the proposed CAFormer block is integrated into the last 3 layers of the backbone, and the CTE strategy is adopted at layers 3,6 and 9. The search regions are resized to 256 256, while the templates are resized to 128 128. For the training process, CAFormer is trained on 2 NVIDIA 2080ti GPUs with a global batch size of 32. We set the learning rates of the backbone network and other parameters to 5e-6 and 5e-5, respectively. The optimization algorithm employed is Adam W (Loshchilov and Hutter 2017) with a weight decay of 1e-4. We train our model for 10 epochs on the training set of Las He R... For VTUAV (Pengyu et al. 2022) dataset, we adopt the VTUAV training set for our training process, and adjust the number of training epochs to 5. Following previous work (Hui et al. 2023), all experiments in this paper are loaded with pre-trained weights from the public SOT method (Ye et al. 2022).