WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving
Authors: Yiheng Li, Cunxin Fan, Chongjian Ge, Seth Z. Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, Mingyu Ding, Wei Zhan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLa VA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. |
| Researcher Affiliation | Academia | 1UC Berkeley 2UCLA 3UNC-Chapel Hill 4UT Austin. Correspondence to: Mingyu Ding <EMAIL>, Chen Tang <EMAIL>. |
| Pseudocode | No | The paper describes methods like "automated data curation pipeline" and "Chain-of-Thought (Co T) approach" but does not contain a dedicated section or figure labeled "Pseudocode" or "Algorithm" with structured steps. |
| Open Source Code | Yes | The codes & prompts to build it are available on https://github.com/yhli123/WOMD-Reasoning. |
| Open Datasets | Yes | Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios... The dataset and its vision modal extension are available on https://waymo.com/open/download/. |
| Dataset Splits | Yes | We build two subsets of WOMD-Reasoning with the same setting: the training set is built on the training set of WOMD, while the validation set is built on the interactive validation set of WOMD. In total, we translate 63k scenarios into language. [Additionally, Table 2 states: Training 2,430k Total Q&As 52k Total Scenes; Validation 510k Total Q&As 11k Total Scenes.] |
| Hardware Specification | Yes | Our multi-modal fine-tuning takes 2 GPU days (1 day on 2x NVIDIA A6000 GPUs) to train 1 epoch on the entire training set of WOMD-Reasoning. |
| Software Dependencies | Yes | Specifically, we take LLa VA-v1.5-7b (Liu et al., 2024) as the pre-trained VLM. The t5 v1_1 xxl model is used as the language encoder. |
| Experiment Setup | Yes | During training, we unfreeze all components including the motion vector encoder, and train on all Q&A pairs in WOMD-Reasoning simultaneously to avoid the potential catastrophic forgetting. In the experiment using languages from Motion-LLa VA outputs to assist vehicle trajectory predictions, we use a batch size of 128 and a learning rate of 1e 4 for Multipath++. |