FreeVS: Generative View Synthesis on Free Driving Trajectory

Authors: Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the Waymo Open Dataset show that Free VS has a strong image synthesis performance on both the recorded trajectories and novel trajectories.
Researcher Affiliation Academia 1School of Future Technology, University of Chinese Academy of Sciences (UCAS), 2NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA), 3CUHK 4Center for Artificial Intelligence and Robotics, HKISI, CAS EMAIL, EMAIL
Pseudocode No The paper describes the method and training process verbally and through equations, but does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes Project Page & Code: https://freevs24.github.io/
Open Datasets Yes Experiments on the Waymo Open Dataset show that Free VS has a strong image synthesis performance on both the recorded trajectories and novel trajectories.
Dataset Splits Yes For the front-view or multi-view novel frame synthesis benchmark (Fig. 3(a) and (b)), we sample every fourth frame in driving sequences as test frames. All the remaining frames are used for training NVS counterparts, or as input frames for Free VS. Under the novel camera synthesis benchmark, we reserve all the front-side camera views as test views and use the front and side camera views as train views throughout each sequence.
Hardware Specification Yes All experiments are conducted on NVIDIA L20 GPUs. The training costs of generalizable reconstruction methods are measured on 2 RTX 3090 GPUs, while the training cost of Free VS is measured on 8 NVIDIA L20 GPUs. Samely, the inference efficiency of previous methods / Free VS is measured on 3090 / L20 GPU.
Software Dependencies No The paper mentions using Stable Video Diffusion, Stable Diffusion checkpoints, and specific optimizer (AdamW) and model backbones (ConVNext-T, CLIP-vision model), but does not provide version numbers for programming languages, libraries, or frameworks like PyTorch or TensorFlow, which are essential for reproducibility.
Experiment Setup Yes We train the model for 40,000 iterations with a batch size of 8 and video frame length n = 8. We use the Adam W optimizer (Kingma & Ba, 2014) with a learning rate 1 10 4. During training time, we randomly drop the pseudo-image condition latent as well as the CLIP text description latent with a probability of 20%. We enable the viewpoint transformation simulation with a probability of 50%. During inference, we set the number of sampling steps as 25 and stochasticity η=1.0. When synthesizing images on the existing trajectory, we set the classifierfree guidance (CFG)(Ho & Salimans, 2022) guidance scale to 1.0. For synthesizing images on novel cameras and new trajectories, we enlarge the CFG guidance scale to 2.0 to strengthen the control of 3D prior conditions over the generated results.