Template-based Uncertainty Multimodal Fusion Network for RGBT Tracking

Authors: Zhaodong Ding, Chenglong Li, Shengqing Miao, Jin Tang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments suggest that our method outperforms existing approaches on four RGBT tracking benchmarks. (Abstract) Extensive experiments demonstrate that our method outperforms existing RGBT tracking methods on four popular RGBT tracking datasets, including GTOT [Li et al., 2016], RGBT210 [Li et al., 2017], RGBT234 [Li et al., 2019a] and Las He R [Li et al., 2021]. (Section 1 - Contributions) 4 Experiment 4.4 Ablation Study
Researcher Affiliation Academia 1National Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, Hefei, 230601, China 2Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei, 230601, China 3Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Hefei, 230601, China 4School of Artificial Intelligence, Anhui University, Hefei, 230601, China 5School of Computer Science and Technology, Anhui University, Hefei, 230601, China 6School of Electronic and Information Engineering, Anhui University, Hefei, 230601, China zhaodongding EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes *https://github.com/dongdong2061/IJCAI25-TUMFNet (Conclusion section)
Open Datasets Yes Extensive experiments demonstrate that our method outperforms existing RGBT tracking methods on four popular RGBT tracking datasets, including GTOT [Li et al., 2016], RGBT210 [Li et al., 2017], RGBT234 [Li et al., 2019a] and Las He R [Li et al., 2021]. (Section 1 - Contributions) GTOT [Li et al., 2016] dataset is the earliest publicly available RGBT tracking dataset, containing a total of 50 sequences and approximately 15,000 frames. RGBT210 [Li et al., 2017] dataset consists of 210 pairs of RGBT video sequences, totaling 209.4K frames, and includes annotations across 12 attributes. RGBT234 [Li et al., 2019a] dataset extends RGBT210 dataset, offering more precise annotations. It contains a total of 234 pairs of RGBT video sequences, amounting to approximately 233.4K frames. Las He R [Li et al., 2021] is a large RGBT tracking dataset, consisting of 1,224 aligned video sequences with approximately 1,469.6K frames. It includes 245 test sequences and 979 training sequences, covering 19 real-world challenge attributes. (Section 4.2)
Dataset Splits Yes We train the overall tracking network end-to-end using the Las He R training set to evaluate GTOT, RGBT210, RGBT234, and Las He R test set. (Section 4.1) Las He R [Li et al., 2021] is a large RGBT tracking dataset, consisting of 1,224 aligned video sequences with approximately 1,469.6K frames. It includes 245 test sequences and 979 training sequences, covering 19 real-world challenge attributes. (Section 4.2)
Hardware Specification Yes Our model is implemented using Py Torch and experiments are conducted on one RTX 4090 GPU. (Section 4.1)
Software Dependencies No Our model is implemented using Py Torch and experiments are conducted on one RTX 4090 GPU. We take OSTrack [Ye et al., 2022] as the base tracker, which employs Vi T as the backbone network for feature extraction. (Section 4.1) The paper mentions PyTorch, OSTrack, and ViT, but does not provide specific version numbers for any of these software components.
Experiment Setup Yes The input search region and template sizes of the model are 256 256 and 128 128, respectively. The learning rate for the backbone network is set to 1 10 5, while the learning rate for the other parameters is set to 1 10 4. The model is trained for a total of 15 epochs. Additionally, we use the Adam W optimizer with a weight decay of 1 10 4. Note that our UMFM is inserted into the 10th, 11th, and 12th blocks of the Vi T. λ3 is set to 0.01 and N is set to 20. (Section 4.1)