S2-Track: A Simple yet Strong Approach for End-to-End 3D Multi-Object Tracking

Authors: Tao Tang, Lijun Zhou, Pengkun Hao, Zihang He, Kalok Ho, Shuo Gu, Zhihui Hao, Haiyang Sun, Kun Zhan, Peng Jia, Xianpeng Lang, Xiaodan Liang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on nu Scenes benchmark demonstrate the effectiveness of our S2-Track framework. It achieves state-of-the-art performance with an impressive 66.3% AMOTA on the test split, surpassing the previous best end-to-end solution by a significant margin of 8.9% AMOTA. These results highlight our simple yet non-trivial improvements and showcase the potential of our framework in advancing the field of autonomous driving perception.
Researcher Affiliation Collaboration *Equal contribution , Work done during an internship at Li Auto Inc. 1Shenzhen Campus of Sun Yat-sen University 2Li Auto Inc. Correspondence to: Xiaodan Liang <EMAIL>.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Methodologies are described in text and supported by architectural diagrams.
Open Source Code No We will include these implementations in the revision.
Open Datasets Yes We conduct experiments on the popular nu Scenes benchmark (Caesar et al., 2020), which is a large-scale autonomous-driving dataset for 3D detection and tracking, consisting of 700, 150, and 150 scenes for training, validation, and testing, respectively.
Dataset Splits Yes We conduct experiments on the popular nu Scenes benchmark (Caesar et al., 2020), which is a large-scale autonomous-driving dataset for 3D detection and tracking, consisting of 700, 150, and 150 scenes for training, validation, and testing, respectively.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A100-80GB GPUs.
Software Dependencies No The paper mentions the AdamW optimizer but does not specify any software versions for libraries or programming languages (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We adopt the Adam W optimizer (Loshchilov & Hutter, 2017) for network training, with the initial learning rate setting of 0.01 and the cosine weight decay set to 0.001. By default, the thresholds βlower and βupper are set to 0.3 and 0.7, and the weight coefficients λ that are all set to 1.0, respectively. We pre-train the image backbone with single-frame detection task for 12 epochs (small-resolution setting) and 24 epochs (full-resolution setting) respectively, and further train the end-to-end tracker with consecutive frames (set to be 3 frames) for another 12 epochs (small-resolution) and 24 epochs (full-resolution).