reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DriveGPT: Scaling Autoregressive Behavior Models for Driving

Authors: Xin Huang, Eric M Wolff, Paul Vernaza, Tung Phan-Minh, Hongge Chen, David S Hayden, Mark Edmonds, Brian Pierce, Xinxin Chen, Pratik Elias Jacob, Xiaobai Chen, Chingiz Tairbekov, Pratik Agarwal, Tianshi Gao, Yuning Chai, Siddhartha Srinivasa

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Drive GPT across different scales in a planning task, through both quantitative metrics and qualitative examples, including closed-loop driving in complex real-world scenarios. In a separate prediction task, Drive GPT outperforms state-of-the-art baselines and exhibits improved performance by pretraining on a large-scale dataset, further validating the benefits of data scaling. We present a comprehensive study of scaling up data sizes and model parameters in the context of behavior modeling for autonomous driving... We quantitatively and qualitatively compare models from our scaling experiments to validate their effectiveness in real-world driving scenarios. We present real-world deployment of our model through closed-loop driving in challenging conditions.
Researcher Affiliation	Industry	1Cruise LLC, San Francisco, CA 2Meta, Menlo Park, CA. Correspondence to: Xin Huang, Eric M. Wolff <EMAIL>.
Pseudocode	No	The paper describes the model architecture and training process in detail using text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code or a link to a code repository for the Drive GPT methodology. It mentions using an "open-source MTR (Shi et al., 2022) encoder" for external evaluation, but this refers to a third-party component, not the authors' own implementation.
Open Datasets	Yes	The smallest dataset of 2.2M samples mimics the size of Waymo Open Motion Dataset (WOMD) (Ettinger et al., 2021), a large open-source dataset for behavior modeling... To directly compare with published results, we evaluate Drive GPT on the WOMD motion prediction task.
Dataset Splits	Yes	We evaluate model performance using validation loss, computed on a comprehensive validation set of 10 million samples drawn from the same distribution as the training data, with no overlap. This set remains fixed across all scaling experiments to ensure consistency... The smallest dataset of 2.2M samples mimics the size of Waymo Open Motion Dataset (WOMD)... We pretrain Drive GPT by training on our internal research dataset for one epoch... We load the pretrained checkpoint and finetune the model using the same training setup as in the MTR codebase... Each metric is measured on the test set and computed over three different time horizons.
Hardware Specification	Yes	All models were trained on 16 NVIDIA H100 GPUs.
Software Dependencies	No	The paper mentions using a "standard Adam optimizer" and "AdamW optimizer" but does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages.
Experiment Setup	Yes	We use a single cross-entropy classification loss over the action space... We train our models on the internal research dataset for 1 epoch... Our models follow the implementations described in (Nayakanti et al., 2023; Seff et al., 2023), and are trained using a batch size of 2048 and a standard Adam optimizer... We follow the optimal learning rate schedule discovered in (Hoffmann et al., 2022), which applies a cosine decay with a cycle length equivalent to the total number of training steps... We use a batch size of 80 and an Adam W optimizer with a learning rate of 0.0001. The models are trained for 30 epochs, where the learning rate is decayed by a factor of 0.5 every 2 epochs, starting from epoch 20.