FreeVS: Generative View Synthesis on Free Driving Trajectory
Authors: Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Waymo Open Dataset show that Free VS has a strong image synthesis performance on both the recorded trajectories and novel trajectories. |
| Researcher Affiliation | Academia | 1School of Future Technology, University of Chinese Academy of Sciences (UCAS), 2NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA), 3CUHK 4Center for Artificial Intelligence and Robotics, HKISI, CAS EMAIL, EMAIL |
| Pseudocode | No | The paper describes the method and training process verbally and through equations, but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | Project Page & Code: https://freevs24.github.io/ |
| Open Datasets | Yes | Experiments on the Waymo Open Dataset show that Free VS has a strong image synthesis performance on both the recorded trajectories and novel trajectories. |
| Dataset Splits | Yes | For the front-view or multi-view novel frame synthesis benchmark (Fig. 3(a) and (b)), we sample every fourth frame in driving sequences as test frames. All the remaining frames are used for training NVS counterparts, or as input frames for Free VS. Under the novel camera synthesis benchmark, we reserve all the front-side camera views as test views and use the front and side camera views as train views throughout each sequence. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA L20 GPUs. The training costs of generalizable reconstruction methods are measured on 2 RTX 3090 GPUs, while the training cost of Free VS is measured on 8 NVIDIA L20 GPUs. Samely, the inference efficiency of previous methods / Free VS is measured on 3090 / L20 GPU. |
| Software Dependencies | No | The paper mentions using Stable Video Diffusion, Stable Diffusion checkpoints, and specific optimizer (AdamW) and model backbones (ConVNext-T, CLIP-vision model), but does not provide version numbers for programming languages, libraries, or frameworks like PyTorch or TensorFlow, which are essential for reproducibility. |
| Experiment Setup | Yes | We train the model for 40,000 iterations with a batch size of 8 and video frame length n = 8. We use the Adam W optimizer (Kingma & Ba, 2014) with a learning rate 1 10 4. During training time, we randomly drop the pseudo-image condition latent as well as the CLIP text description latent with a probability of 20%. We enable the viewpoint transformation simulation with a probability of 50%. During inference, we set the number of sampling steps as 25 and stochasticity η=1.0. When synthesizing images on the existing trajectory, we set the classifierfree guidance (CFG)(Ho & Salimans, 2022) guidance scale to 1.0. For synthesizing images on novel cameras and new trajectories, we enlarge the CFG guidance scale to 2.0 to strengthen the control of 3D prior conditions over the generated results. |