Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving
Authors: Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wagner, Christoph Stiller
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the nu Scenes (Caesar et al., 2020) dataset showcase the effectiveness of DMAD structure in mitigating negative transfer. Our approach achieves significant performance gains in perception and prediction, which benefits the planning module and outperforms state-of-the-art (SOTA) E2E AD models. We conduct experiments on the nu Scenes (Caesar et al., 2020) dataset to validate the effectiveness of our method. We present results in three parts. The first part focuses on perception (detection, tracking, and mapping). In the second part, we evaluate motion prediction and planning. Lastly, we provide an extensive ablation study and SHAP values (Lundberg & Lee, 2017) visualization. |
| Researcher Affiliation | Academia | Yinzhe Shen1 Ömer Şahin Taş1,2 Kaiwen Wang1 Royden Wagner1 Christoph Stiller1,2 1Karlsruhe Institute of Technology (KIT) 2FZI Research Center for Information Technology |
| Pseudocode | No | The paper describes the architecture and processes (e.g., 'Interactive Semantic Decoder', 'Neural-Bayes Motion Decoder') in detail using prose and diagrams, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our code is available. |
| Open Datasets | Yes | Experiments on the nu Scenes (Caesar et al., 2020) dataset showcase the effectiveness of DMAD structure in mitigating negative transfer. Our approach achieves significant performance gains in perception and prediction, which benefits the planning module and outperforms state-of-the-art (SOTA) E2E AD models. |
| Dataset Splits | No | The paper mentions using the nu Scenes dataset and conducting a 'two-stage training scheme' with 'queue length' specifications. While nu Scenes is a well-known dataset often used with predefined splits, the paper does not explicitly state the specific training, validation, or test splits used for its experiments, nor does it provide citations for predefined splits in the context of their usage. |
| Hardware Specification | Yes | Compared to Uni AD (Hu et al., 2023), our decoders add 13.1M parameters and increase inference latency by 0.02 seconds on an NVIDIA RTX 6000 Ada. |
| Software Dependencies | No | The paper does not explicitly list any specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) used for implementation or experimentation. |
| Experiment Setup | Yes | Two-stage training. We follow the two-stage training scheme of our baseline. In the first stage, we train object detection, tracking, and mapping. In the second stage, we train all modules together. Notably, because our tracking relies on reference points provided by unimodal prediction, we incorporate unimodal prediction training in the first stage. Multimodal prediction is trained only in the second stage, which is consistent with the baseline. Queue length. Since AD is a time-dependent task, the model typically processes a sequence of consecutive frames as a training sample. The number of input frames, i.e., the queue length q, defines the temporal horizon the model can capture, impacting the performance of related tasks. Uni AD employs different queue lengths across its two training stages: 5 in the first stage and 3 in the second. The multi-head self-attention module is configured with 8 heads, an embedding dimension of 256, and a dropout rate of 0.1. The FFN consists of two linear layers with an intermediate ReLU activation, which expands the dimension from 256 to an inner-layer dimension of 512 before projecting it back to 256. |