reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

Authors: Yiheng Li, Cunxin Fan, Chongjian Ge, Seth Z. Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, Mingyu Ding, Wei Zhan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLa VA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc.
Researcher Affiliation	Academia	1UC Berkeley 2UCLA 3UNC-Chapel Hill 4UT Austin. Correspondence to: Mingyu Ding <EMAIL>, Chen Tang <EMAIL>.
Pseudocode	No	The paper describes methods like "automated data curation pipeline" and "Chain-of-Thought (Co T) approach" but does not contain a dedicated section or figure labeled "Pseudocode" or "Algorithm" with structured steps.
Open Source Code	Yes	The codes & prompts to build it are available on https://github.com/yhli123/WOMD-Reasoning.
Open Datasets	Yes	Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios... The dataset and its vision modal extension are available on https://waymo.com/open/download/.
Dataset Splits	Yes	We build two subsets of WOMD-Reasoning with the same setting: the training set is built on the training set of WOMD, while the validation set is built on the interactive validation set of WOMD. In total, we translate 63k scenarios into language. [Additionally, Table 2 states: Training 2,430k Total Q&As 52k Total Scenes; Validation 510k Total Q&As 11k Total Scenes.]
Hardware Specification	Yes	Our multi-modal fine-tuning takes 2 GPU days (1 day on 2x NVIDIA A6000 GPUs) to train 1 epoch on the entire training set of WOMD-Reasoning.
Software Dependencies	Yes	Specifically, we take LLa VA-v1.5-7b (Liu et al., 2024) as the pre-trained VLM. The t5 v1_1 xxl model is used as the language encoder.
Experiment Setup	Yes	During training, we unfreeze all components including the motion vector encoder, and train on all Q&A pairs in WOMD-Reasoning simultaneously to avoid the potential catastrophic forgetting. In the experiment using languages from Motion-LLa VA outputs to assist vehicle trajectory predictions, we use a batch size of 128 and a learning rate of 1e 4 for Multipath++.