Glad: A Streaming Scene Generator for Autonomous Driving

Authors: Bin Xie, Yingfei Liu, Tiancai Wang, Jiale Cao, Xiangyu Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are performed on the widely-used nu Scenes dataset. Experimental results demonstrate that our proposed Glad achieves promising performance, serving as a strong baseline for online video generation. We perform the experiments on the public autonomous driving dataset nu Scenes, which demonstrates the efficacy of our Glad. The experiments are performed on the widely-used dataset nu Scenes.
Researcher Affiliation Collaboration Bin Xie1 , Yingfei Liu2 , Tiancai Wang2, Jiale Cao1 , Xiangyu Zhang2,3 1Tianjin University, 2MEGVII Technology, 3Step Fun EMAIL EMAIL
Pseudocode No The paper describes the methodology using textual explanations and diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We will release the source code and models publicly.
Open Datasets Yes Extensive experiments are performed on the widely-used nu Scenes dataset. nu Scenes (Caesar et al., 2019), CARLA (Dosovitskiy et al., 2017), Waymo (Ettinger et al., 2021), and ONCE (Mao et al., 2021).
Dataset Splits Yes The nu Scenes dataset was collected from 1000 different driving scenes in Boston and Singapore. These scenes are split into training, validation, and test sets. Specifically, the training set contains 700 scenes, the validation set contains 150 scenes, and the test set contains 150 scenes. In every scene, there are 6 camera views and each view records a length of about 20 second driving video. We split each video into 2 clips to balance video length and data diversity.
Hardware Specification Yes We train our models on 8 NVIDIA A100 GPUs with the mini-batch of 2 images. The inference time of complete denoising process is reported in single NVIDIA A100 GPU.
Software Dependencies Yes Our Glad is implemented based on Stable Diffusion 2.1 (Rombach et al., 2022).
Experiment Setup Yes We train our models on 8 NVIDIA A100 GPUs with the mini-batch of 2 images. During training, we first perform image-level pre-training. Constant learning rate 4 10 5 has been adopted, and there are 1.25M iterations totally. Afterwards, we fine-tune our Glad on nu Scenes dataset with same settings for 48 epochs. We split each video into 2 clips to balance video length and data diversity. During inference, we utilize the DDIM (Song et al., 2020) sampler with 25 sampling steps and scale of the CFG as 5.0. The image is generated at a spatial resolution of 256 3072 pixels with 6 different views, and split it to 6 images of 256 512 pixels for evaluation.