EMMA: End-to-End Multimodal Model for Autonomous Driving

Authors: Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, Mingxing Tan

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate EMMA s effectiveness by achieving state-of-the-art performance in motion planning on nu Scenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA s potential as a generalist model for autonomous driving applications.
Researcher Affiliation Industry Contact emails: Mingxing Tan <EMAIL>, Jyh-Jing Hwang <EMAIL>.
Pseudocode No The paper describes its methodology using descriptive text and mathematical equations (e.g., O = G(T, V)), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper mentions using an "open-sourced MLLM, Pa LI-X (Chen et al., 2024d)" for experiments, which refers to a third-party tool. However, it does not contain any explicit statements or links indicating that the authors' own implementation code for EMMA is open-source or publicly available.
Open Datasets Yes Overall, we leverage three public datasets, nu Scenes (Caesar et al., 2020), Waymo Open Motion Dataset (WOMD) (Chen et al., 2024a) and Waymo Open Dataset (WOD) (Sun et al., 2020).
Dataset Splits No The paper describes how individual data samples are structured (e.g., for WOMD, "1 second is used as input context, and the remaining 8 seconds serve as the prediction target"; for nu Scenes, "predict the next 3 seconds of future driving actions based on 2 seconds of historical data") or refers to "standard protocol" for public benchmarks. However, it does not explicitly provide the train/test/validation split ratios or sample counts for the overall datasets used to train the model, nor does it specify custom split files.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or other computing resource specifications used for running its experiments.
Software Dependencies No The paper mentions models like "Gemini 1.0 Nano-1" and "Pa LI-X" but does not list any specific software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages with their version numbers that would be necessary to replicate the experimental setup.
Experiment Setup No The paper describes some aspects of the training strategy, such as batch sampling for generalist training and Top-K decoding for inference. However, it does not provide specific hyperparameters like learning rates, exact batch sizes, number of epochs for main model training, or optimizer details required to reproduce the experimental setup.