Enhancing End-to-End Autonomous Driving with Latent World Model

Authors: Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, Tieniu Tan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental LAW achieves state-of-the-art performance across multiple benchmarks, including real-world open-loop benchmark nu Scenes, NAVSIM, and simulator-based closed-loop benchmark CARLA. The code is released at https://github.com/Brave Group/LAW. 5 EXPERIMENTS 5.1 BENCHMARKS 5.2 IMPLEMENTATION DETAILS 5.3 COMPARISON WITH STATE-OF-THE-ART METHODS 5.4 ABLATION STUDY
Researcher Affiliation Academia Yingyan Li1,2,3,4 Lue Fan1,2,3 Jiawei He1,2,3 Yuqi Wang1,2,3 Yuntao Chen1,2,3 Zhaoxiang Zhang1,2,3,4B Tieniu Tan1,2,3 1 Institute of Automation, Chinese Academy of Sciences (CASIA) 2 New Laboratory of Pattern Recognition (NLPR) 3 State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS) 4 School of Future Technology, University of Chinese Academy of Sciences (UCAS) 1Email: EMAIL
Pseudocode No The paper describes the methodology using mathematical formulations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes LAW achieves state-of-the-art performance across multiple benchmarks, including real-world open-loop benchmark nu Scenes, NAVSIM, and simulator-based closed-loop benchmark CARLA. The code is released at https://github.com/Brave Group/LAW.
Open Datasets Yes Experiments show that our latent world model enhances performance in both perception-free and perception-based frameworks. Furthermore, we achieve state-of-the-art performance on multiple benchmarks, including the real-world open-loop datasets nu Scenes (Caesar et al., 2020) and NAVSIM (Dauner et al., 2024) (based on nu Plan (Caesar et al., 2021)), as well as the simulator-based closed-loop CARLA benchmark (Dosovitskiy et al., 2017).
Dataset Splits No For the closed-loop benchmark, the training dataset is collected from the CARLA (Dosovitskiy et al., 2017) simulator (version 0.9.10.1) using the teacher model Roach (Zhang et al., 2021) following (Wu et al., 2022; Jia et al., 2023b), resulting in 189K frames. We use the widely-used Town05 Long benchmark (Jia et al., 2023b; Shao et al., 2022; Hu et al., 2022a) to assess the closed-loop driving performance. The paper mentions using well-known benchmarks (nu Scenes, NAVSIM, CARLA) and collecting a training dataset for CARLA, but it does not explicitly specify the training, validation, and test splits (e.g., percentages or exact counts) used for the experiments, nor does it cite specific standard splits for their experimental setup beyond mentioning the benchmarks.
Hardware Specification Yes The model is trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01, batch size 8, and 12 epochs across 8 A6000 GPUs.
Software Dependencies No The paper mentions various models (e.g., Swin-Transformer-Tiny, Res Net-34 backbone) and optimizers (e.g., Adam W optimizer), but does not specify software dependencies with version numbers like Python, PyTorch, or CUDA versions.
Experiment Setup Yes nu Scenes Benchmark We implement both perception-free and perception-based frameworks. In the perception-free framework, Swin-Transformer-Tiny (Swin-T)(Liu et al., 2021) is used as the backbone. Input images are resized to 800 320. We adopt a Cosine Annealing learning rate schedule(Loshchilov & Hutter, 2016), starting at 5e-5. The model is trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01, batch size 8, and 12 epochs across 8 A6000 GPUs. NAVSIM Benchmark The perception-free framework is implemented on NAVSIM. Specifically, We employ a Res Net-34 backbone, training for 20 epochs in line with Prakash et al. (2021) to ensure a fair comparison. Input images are resized to 640 320. The Adam optimizer is used with a learning rate of 1e-4 and a batch size of 32. CARLA Benchmark We follow Wu et al. (2022) to implement a perception-free framework on CARLA. To be specific, we use Res Net-34 as the backbone and employ the TCP head (Wu et al., 2022) as in Jia et al. (2023b). Input images are resized to 900 256. The Adam optimizer is used with a learning rate of 1e-4 and weight decay of 1e-7. The model is trained for 60 epochs with a batch size of 128. After 30 epochs, the learning rate is halved.