World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous Driving
Authors: Mingliang Zhai, Cheng Li, Zengyuan Guo, Ningrui Yang, Xiameng Qin, Sanyuan Zhao, Junyu Han, Ji Tao, Yuwei Wu, Yunde Jia
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the effectiveness of our proposed method. To evaluate the model s utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. |
| Researcher Affiliation | Collaboration | 1Beijing Institute of Technology 2Shenzhen MSU-BIT University 3Chongqing Changan Automobile Co., Ltd. EMAIL EMAIL |
| Pseudocode | No | The paper describes methods using mathematical formulations (e.g., equations for interactor process) and textual descriptions, but no explicit pseudocode blocks or algorithms are presented. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | To achieve multi-modal alignment, we collected and refined a large-scale multi perspective image text pair, including 1.7M grounding data, 200K object-level caption data (objects, risks, weather etc.), 4 open-source datasets and our object-level risk assessment dataset, total 4M samples. Then we format all the data into a unified format. Regarding the grounding data, we use a pre-trained Grounding-DINO (Liu et al. 2023b) model, specifically trained on traffic scenes, to extract all significant objects from single-view images, such as vehicles, pedestrians, traffic signs, and traffic lights. Object-level Risks Assessment (ORA) To evaluate the model performance in perception-limited regions, we propose an object-level risks assessment dataset base on Nu Scenes (Caesar et al. 2019). |
| Dataset Splits | Yes | Dataset Train Test Nu Scenes-QA 376k 83k Nu Scenes-MQA 1204k 255k Omni Drive-Nu Scenes 486k 90k Nu Instruct 72k 15k Risk Assessment 166k 35k Total 2304k 478k |
| Hardware Specification | Yes | We use 32 Tesla A100 80G to train 3 days. For this stage, we use 8 Tesla A100 80G GPUs, and the training is conducted over a period of 8 hours. |
| Software Dependencies | No | The paper mentions models like EVA-02-L and LLa MA3-8B and an optimizer (Adam W), but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | When selecting top-k tokens, we set k = 90 for image features and k = 300 for BEV features. During the single-view and multi-view alignment pretraining stage, we adopt the same strategies as LLava Next (Liu et al. 2024a), including optimizer, learning rate, and batch size, training for 2 epochs. For task-specific instruction tuning stage, we use the Adam W (Loshchilov and Hutter 2017) optimizer, setting the learning rate to 1 10 5 and a batch size of 8. To promote training stability and convergence, we implement a cosine annealing learning rate schedule with a warm-up period. |