X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios
Authors: Yichen Xie, Chenfeng Xu, Chensheng Peng, Shuqi Zhao, Nhat Ho, Alexander Pham, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency. Our code will be made publicly available at https://github.com/yichen928/X-Drive. Extensive experiments demonstrate the great ability of X-DRIVE in generating realistic multimodality sensor data. It notably outperforms previous specialized single-modality algorithms in the quality of both synthetic point clouds and multi-view images. |
| Researcher Affiliation | Collaboration | Yichen Xie1 , Chenfeng Xu1 , Chensheng Peng1, Shuqi Zhao1, Nhat Ho2, Alexander T. Pham3, Mingyu Ding1 , Masayoshi Tomizuka1, Wei Zhan1 1UC Berkeley 2UT Austin 3Toyota North America |
| Pseudocode | No | The paper describes the methodology using textual explanations and diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code will be made publicly available at https://github.com/yichen928/X-Drive. |
| Open Datasets | Yes | Dataset. We evaluate our method using nu Scenes dataset (Caesar et al., 2020). |
| Dataset Splits | Yes | We follow the official setting to employ 700 driving scenes for training and 150 scenes for validation. |
| Hardware Specification | Yes | In all the stages, our model is trained using NVIDIA RTX A6000 GPUs. |
| Software Dependencies | Yes | We utilize the Stable-Diffusion pretrained weight to initialize the multi-view image branch with other newly added parameters randomly initialized. We follow Magic Drive (Gao et al., 2023) and Range LDM (Hu et al., 2024a) to synthesize 224 400 multi-view camera images and 32 1024 point cloud range images. |
| Experiment Setup | Yes | In the first stage, VAE for Li DAR range image is trained using batch size 96 and learning rate 4e-4 for 200 epochs. The discriminator takes effect after 1000 iterations. In the second stage, we train the Li DAR LDM from scratch using batch size 96 and learning rate 1e-4 for 2000 epochs. The model includes the text prompt and 3D range-view bounding box condition modules with drop-rate 0.25 for either condition during training. The entire model is trained for 250 epochs with learning rate 8e-5 and batch size 24 in our main experiments. For ablation studies, we reduce the epoch number to 80 for efficiency. |