Enhancing End-to-End Autonomous Driving with Latent World Model
Authors: Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, Tieniu Tan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | LAW achieves state-of-the-art performance across multiple benchmarks, including real-world open-loop benchmark nu Scenes, NAVSIM, and simulator-based closed-loop benchmark CARLA. The code is released at https://github.com/Brave Group/LAW. 5 EXPERIMENTS 5.1 BENCHMARKS 5.2 IMPLEMENTATION DETAILS 5.3 COMPARISON WITH STATE-OF-THE-ART METHODS 5.4 ABLATION STUDY |
| Researcher Affiliation | Academia | Yingyan Li1,2,3,4 Lue Fan1,2,3 Jiawei He1,2,3 Yuqi Wang1,2,3 Yuntao Chen1,2,3 Zhaoxiang Zhang1,2,3,4B Tieniu Tan1,2,3 1 Institute of Automation, Chinese Academy of Sciences (CASIA) 2 New Laboratory of Pattern Recognition (NLPR) 3 State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS) 4 School of Future Technology, University of Chinese Academy of Sciences (UCAS) 1Email: EMAIL |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | LAW achieves state-of-the-art performance across multiple benchmarks, including real-world open-loop benchmark nu Scenes, NAVSIM, and simulator-based closed-loop benchmark CARLA. The code is released at https://github.com/Brave Group/LAW. |
| Open Datasets | Yes | Experiments show that our latent world model enhances performance in both perception-free and perception-based frameworks. Furthermore, we achieve state-of-the-art performance on multiple benchmarks, including the real-world open-loop datasets nu Scenes (Caesar et al., 2020) and NAVSIM (Dauner et al., 2024) (based on nu Plan (Caesar et al., 2021)), as well as the simulator-based closed-loop CARLA benchmark (Dosovitskiy et al., 2017). |
| Dataset Splits | No | For the closed-loop benchmark, the training dataset is collected from the CARLA (Dosovitskiy et al., 2017) simulator (version 0.9.10.1) using the teacher model Roach (Zhang et al., 2021) following (Wu et al., 2022; Jia et al., 2023b), resulting in 189K frames. We use the widely-used Town05 Long benchmark (Jia et al., 2023b; Shao et al., 2022; Hu et al., 2022a) to assess the closed-loop driving performance. The paper mentions using well-known benchmarks (nu Scenes, NAVSIM, CARLA) and collecting a training dataset for CARLA, but it does not explicitly specify the training, validation, and test splits (e.g., percentages or exact counts) used for the experiments, nor does it cite specific standard splits for their experimental setup beyond mentioning the benchmarks. |
| Hardware Specification | Yes | The model is trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01, batch size 8, and 12 epochs across 8 A6000 GPUs. |
| Software Dependencies | No | The paper mentions various models (e.g., Swin-Transformer-Tiny, Res Net-34 backbone) and optimizers (e.g., Adam W optimizer), but does not specify software dependencies with version numbers like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | nu Scenes Benchmark We implement both perception-free and perception-based frameworks. In the perception-free framework, Swin-Transformer-Tiny (Swin-T)(Liu et al., 2021) is used as the backbone. Input images are resized to 800 320. We adopt a Cosine Annealing learning rate schedule(Loshchilov & Hutter, 2016), starting at 5e-5. The model is trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01, batch size 8, and 12 epochs across 8 A6000 GPUs. NAVSIM Benchmark The perception-free framework is implemented on NAVSIM. Specifically, We employ a Res Net-34 backbone, training for 20 epochs in line with Prakash et al. (2021) to ensure a fair comparison. Input images are resized to 640 320. The Adam optimizer is used with a learning rate of 1e-4 and a batch size of 32. CARLA Benchmark We follow Wu et al. (2022) to implement a perception-free framework on CARLA. To be specific, we use Res Net-34 as the backbone and employ the TCP head (Wu et al., 2022) as in Jia et al. (2023b). Input images are resized to 900 256. The Adam optimizer is used with a learning rate of 1e-4 and weight decay of 1e-7. The model is trained for 60 epochs with a batch size of 128. After 30 epochs, the learning rate is halved. |