World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous Driving

Authors: Mingliang Zhai, Cheng Li, Zengyuan Guo, Ningrui Yang, Xiameng Qin, Sanyuan Zhao, Junyu Han, Ji Tao, Yuwei Wu, Yunde Jia

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness of our proposed method. To evaluate the model s utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution.
Researcher Affiliation Collaboration 1Beijing Institute of Technology 2Shenzhen MSU-BIT University 3Chongqing Changan Automobile Co., Ltd. EMAIL EMAIL
Pseudocode No The paper describes methods using mathematical formulations (e.g., equations for interactor process) and textual descriptions, but no explicit pseudocode blocks or algorithms are presented.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets Yes To achieve multi-modal alignment, we collected and refined a large-scale multi perspective image text pair, including 1.7M grounding data, 200K object-level caption data (objects, risks, weather etc.), 4 open-source datasets and our object-level risk assessment dataset, total 4M samples. Then we format all the data into a unified format. Regarding the grounding data, we use a pre-trained Grounding-DINO (Liu et al. 2023b) model, specifically trained on traffic scenes, to extract all significant objects from single-view images, such as vehicles, pedestrians, traffic signs, and traffic lights. Object-level Risks Assessment (ORA) To evaluate the model performance in perception-limited regions, we propose an object-level risks assessment dataset base on Nu Scenes (Caesar et al. 2019).
Dataset Splits Yes Dataset Train Test Nu Scenes-QA 376k 83k Nu Scenes-MQA 1204k 255k Omni Drive-Nu Scenes 486k 90k Nu Instruct 72k 15k Risk Assessment 166k 35k Total 2304k 478k
Hardware Specification Yes We use 32 Tesla A100 80G to train 3 days. For this stage, we use 8 Tesla A100 80G GPUs, and the training is conducted over a period of 8 hours.
Software Dependencies No The paper mentions models like EVA-02-L and LLa MA3-8B and an optimizer (Adam W), but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes When selecting top-k tokens, we set k = 90 for image features and k = 300 for BEV features. During the single-view and multi-view alignment pretraining stage, we adopt the same strategies as LLava Next (Liu et al. 2024a), including optimizer, learning rate, and batch size, training for 2 epochs. For task-specific instruction tuning stage, we use the Adam W (Loshchilov and Hutter 2017) optimizer, setting the learning rate to 1 10 5 and a batch size of 8. To promote training stability and convergence, we implement a cosine annealing learning rate schedule with a warm-up period.