SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

Authors: Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, Bo Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on two benchmark datasets: BDDX (Kim et al., 2018) and Drive LM (Sima et al., 2023), both featuring low-level control signals and high-level action descriptions. Our experimental results demonstrate significant improvements in both low-level control accuracy and high-level action prediction. First, for low-level prediction on the BDD-X dataset, it reduces the Root Mean Square Error (RMSE) for speed and course predictions by an additional 5.8% and 14.1% over the state-of-the-art (SOTA) baselines, respectively. Furthermore, on the Drive LM dataset, it decreases the Average Displacement Error (ADE) for motion prediction by 44.4%.
Researcher Affiliation Collaboration 1University of Chicago 2University of Illinois Urbana Champaign 3Nuro 4Virtue AI.
Pseudocode Yes The pseudocode for implementing this loss during practice is provided in Appendix B. Besides, to balance the loss among various float numbers, we standardize their representation by using consistent digit lengths in text form.
Open Source Code Yes The code is available at https:// github.com/AI-secure/Safe Auto.
Open Datasets Yes We evaluate our approach on two benchmark datasets: BDDX (Kim et al., 2018) and Drive LM (Sima et al., 2023), both featuring low-level control signals and high-level action questions. (a) BDD-X: In this work, we adopt the processed version from RAGDriver (Yuan et al., 2024), where the task involves using an input video along with control signals from the past seven frames as context for a conversation that focuses on three types of questions. (b) Drive LM: The Drive LM dataset is built upon the nuScenes dataset (Caesar et al., 2020).
Dataset Splits Yes This processed dataset contains 16,390 training video QA conversations and 2,123 test conversations. (b) Drive LM: The Drive LM dataset is built upon the nuScenes dataset (Caesar et al., 2020). In this work, we primarily focus on tasks that involve using six multi-view images from the current frame, and control signals including trajectory positions from the past three seconds as input context. The conversation concentrates on: (i) planning for possible high-level safe actions, (ii) high-level behavior involving predicting speed and steering actions, which serve as multiple-choice questions, and (iii) low-level motion, predicting 2D trajectories for the next three seconds, similar to Uni AD (Hu et al., 2023). We filter instances to include only those with a prediction horizon of at least 3 seconds, resulting in a final dataset of 3,447 training conversations and 685 test conversations.
Hardware Specification Yes All experiments are conducted on eight NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions using "Video-LLa VA" with "Vicuna 1.5 7B", "YOLOv8", and "sentence-t5-xl" as models/tools, but it does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes Model. We use the pretrained Video-LLa VA (Lin et al., 2023) with Vicuna 1.5 7B (Zheng et al., 2023) as the base LLM for fine-tuning. We fine-tune the model for 2 epochs with a batch size of 128 on the BDD-X dataset and for 4 epochs with a batch size of 64 on the Drive LM dataset, using a learning rate of 5 10 2. Experimental Details. (a) PDCE loss: During the finetuning of the MLLM, we initialize σ in D(µ, σ) at a small value of 0.01 and geometrically increase it after each optimization step until it reaches the predefined value of σ = 0.35. (b) Post-safety verification via MLN: we fine-tune YOLOv8 (Jocher et al., 2023) using LISA dataset (Jensen et al., 2016) as the object detector for both traffic lights and signs. (c) Multimodal RAG: we consistently employ four-layer multilayer perceptrons (MLPs) as projectors to obtain aligned embeddings for each modality and to generate the final unified embedding, and we use sentence-t5-xl (Ni et al., 2022) as our text encoder. The weighting factors wv and wc are both set to 0.4, while the weight for the predicate embedding wp is set to 0.2. We consistently set the learning rate to 0.001 and the temperature parameter τ to 0.5 for training. On the BDD-X dataset, the projectors are trained for 100 epochs with a batch size of 2,048; while for the Drive LM dataset, the projectors are also trained for 100 epochs but with a batch size of 512. Finally, we retrieve the Top K = 2 examples on BDD-X dataset, and Top K = 1 example for Drive LM dataset on finetuning MLLM and inference.