X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios

Authors: Yichen Xie, Chenfeng Xu, Chensheng Peng, Shuqi Zhao, Nhat Ho, Alexander Pham, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency. Our code will be made publicly available at https://github.com/yichen928/X-Drive. Extensive experiments demonstrate the great ability of X-DRIVE in generating realistic multimodality sensor data. It notably outperforms previous specialized single-modality algorithms in the quality of both synthetic point clouds and multi-view images.
Researcher Affiliation Collaboration Yichen Xie1 , Chenfeng Xu1 , Chensheng Peng1, Shuqi Zhao1, Nhat Ho2, Alexander T. Pham3, Mingyu Ding1 , Masayoshi Tomizuka1, Wei Zhan1 1UC Berkeley 2UT Austin 3Toyota North America
Pseudocode No The paper describes the methodology using textual explanations and diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code will be made publicly available at https://github.com/yichen928/X-Drive.
Open Datasets Yes Dataset. We evaluate our method using nu Scenes dataset (Caesar et al., 2020).
Dataset Splits Yes We follow the official setting to employ 700 driving scenes for training and 150 scenes for validation.
Hardware Specification Yes In all the stages, our model is trained using NVIDIA RTX A6000 GPUs.
Software Dependencies Yes We utilize the Stable-Diffusion pretrained weight to initialize the multi-view image branch with other newly added parameters randomly initialized. We follow Magic Drive (Gao et al., 2023) and Range LDM (Hu et al., 2024a) to synthesize 224 400 multi-view camera images and 32 1024 point cloud range images.
Experiment Setup Yes In the first stage, VAE for Li DAR range image is trained using batch size 96 and learning rate 4e-4 for 200 epochs. The discriminator takes effect after 1000 iterations. In the second stage, we train the Li DAR LDM from scratch using batch size 96 and learning rate 1e-4 for 2000 epochs. The model includes the text prompt and 3D range-view bounding box condition modules with drop-rate 0.25 for either condition during training. The entire model is trained for 250 epochs with learning rate 8e-5 and batch size 24 in our main experiments. For ablation studies, we reduce the epoch number to 80 for efficiency.