OmniRe: Omni Urban Scene Reconstruction

Authors: Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, Li Song, Yue Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on the Waymo dataset show that our approach outperforms prior state-of-the-art methods quantitatively and qualitatively by a large margin. We further extend our results to 5 additional popular driving datasets to demonstrate its generalizability on common urban scenes. Code and results are available at omnire. We perform extensive experiments and ablations to demonstrate the benefits of our holistic framework. Omni Re achieves state-of-the-art performance in scene reconstruction and novel view synthesis (NVS), significantly outperforming previous methods in terms of full image metrics (+1.88 PSNR for reconstruction and +2.38 PNSR for NVS). The differences are pronounced for dynamic actors, such as vehicles (+1.18 PSNR), and humans (+4.09 PSNR for reconstruction and +3.06 PSNR for NVS) (Tab. 1).
Researcher Affiliation Collaboration Ziyu Chen1,6 Jiawei Yang6 Jiahui Huang5 Riccardo de Lutio5 Janick Martinez Esturo5 Boris Ivanovic5 Or Litany2,5 Zan Gojcic5 Sanja Fidler3,5 Marco Pavone4,5 Li Song1 Yue Wang5,6 1Shanghai Jiao Tong University 2Technion 3University of Toronto 4Stanford University 5NVIDIA Research 6University of Southern California
Pseudocode No The paper describes methods and processes in detail, particularly in Section 4 and its subsections, but it does not present these as structured pseudocode or algorithm blocks. The description of the human body pose processing in 4.2 uses flowcharts, but no formal pseudocode.
Open Source Code Yes Code and results are available at omnire. To ensure reproducibility, the code is available at link.
Open Datasets Yes Dataset. We conduct experiments on the Waymo Open Dataset (Sun et al., 2020), which comprises real-world driving logs. We tested up to 32 dynamic scenes in Waymo, including eight highly complex dynamic scenes that, in addition to typical vehicles, also contain diverse dynamic classes such as pedestrians and cyclists. Each selected segment contains approximately 150 frames. The segment IDs are listed in Tab. 12 and Tab. 6. To further demonstrate our effectiveness on common driving scenes, we extend our results to 5 additional popular driving datasets: Nu Scenes (Caesar et al., 2020), Argoverse2 (Wilson et al., 2023), Panda Set (Xiao et al., 2021), KITTI (Geiger et al., 2012), and Nu Plan (Caesar et al., 2021).
Dataset Splits Yes Appearance. We evaluate our method on scene reconstruction and novel view synthesis (NVS) tasks, using every 10th frame as the held-out test set for NVS.
Hardware Specification Yes Our method runs on a single NVIDIA RTX 4090 GPU, with training for each scene taking about 1 hour. Training time varies with different training settings.
Software Dependencies No The paper mentions using Segformer (Xie et al., 2021) and 4D-Humans (Goel et al., 2023), as well as a specific GPT model (GPT-4o (Achiam et al., 2023)), but does not provide specific version numbers for these or other software libraries (e.g., PyTorch, CUDA, Python versions) that would be needed for replication.
Experiment Setup Yes Training: Our method trains for 30,000 iterations with all scene nodes optimized jointly. The learning rate for Gaussian properties aligns with the default settings of 3DGS (Kerbl et al., 2023), but varies slightly across different node types. Specifically, we set the learning rate for the rotation of Gaussians to 5 10 5 for non-rigid SMPL nodes and 1 10 5 for other nodes. The degrees of spherical harmonics are set to 3 for background nodes, rigid nodes, and non-rigid deformable nodes, while it is set to 1 for non-rigid SMPL nodes. The learning rate for the rotation of instance boxes is 1 10 5, decreasing exponentially to 5 10 6. The learning rate for the translation of instance boxes is 5 10 4, decreasing exponentially to 1 10 4. The learning rate for human body poses of non-rigid SMPL nodes is 5 10 5, decreasing exponentially to 1 10 5. For the Gaussian densification strategy, we utilize the absolute gradient of Gaussians introduced in Ye et al. (2024) to control memory usage. We set the densification threshold of position gradient to 3 10 4. This use of absolute gradient has a minimal impact on performance, as discussed in detail in Appendix D.4. The densification threshold for scaling is 3 10 3. Our method runs on a single NVIDIA RTX 4090 GPU, with training for each scene taking about 1 hour. Training time varies with different training settings. Optimization: We utilize the loss function introduced in Eq (7) to jointly optimize all learnable parameters. The image loss is computed as: Limage = (1 λr) L1 + λr LSSIM (8) due to sparse temporal-spatial observation of the dynamic part, its supervision signal is insufficient. To address this, we apply a higher image loss weight to the dynamic regions identified by the rendered dynamic mask. This weight is set to 5. The depth map loss is computed as: Ldepth = 1 hw X Ds ˆD 1 (9) where Ds is the inverse of the sparse depth map. We project Li DAR points onto the image plane to generate the sparse Li DAR map, and ˆD is the inverse of the predicted depth map. The mask loss Lopacity is computed as: Lopacity = 1 X OG log OG 1 X Msky log(1 OG) (10) where Msky is the sky mask, and OG is the rendered opacity map. In addition to the reconstruction losses, we introduce various regularization terms for different Gaussian representations to improve quality. Among these, an important regularization term is Lpose, designed to ensure smooth human body poses θ(t). This term is defined as: 2 θ(t δ) + θ(t + δ) 2θ(t) 1 (11) where δ is a randomly chosen integer from {1, 2, 3, 4, 5}. We set the weight of the SSIM loss, λr, to 0.2, the depth loss, λdepth, to 0.1, the opacity loss, λopacity, to 0.05, and the pose smoothness loss, λpose, to 0.01.